单击此处了解有关此优惠的更多信息!
Click here for more information on this offer!
请注意,样本内容不提供升级优惠。
Please note that upgrade offers are not available from sample content.
本书的补充文件和示例可以在http://examples.oreilly.com/0636920021483/找到。请使用标准桌面网络浏览器访问这些文件,因为它们可能无法从所有电子阅读器设备访问。
Supplemental files and examples for this book can be found at http://examples.oreilly.com/0636920021483/. Please use a standard desktop web browser to access these files, as they may not be accessible from all ereader devices.
本书中引用的所有代码文件或示例都可以在线获取。对于附带光盘的实体书,我们会尽可能发布所有 CD/DVD 内容。请注意,虽然我们通过免费下载提供尽可能多的媒体内容,但有时我们会受到许可限制。请将任何问题或疑虑发送至booktech@oreilly.com。
All code files or examples referenced in the book will be available online. For physical books that ship with an accompanying disc, whenever possible, we’ve posted all CD/DVD content. Note that while we provide as much of the media content as we are able via free download, we are sometimes limited by licensing restrictions. Please direct any questions or concerns to booktech@oreilly.com.
便携式文档格式 (PDF) 是世界领先的页面描述语言,也是第一种对印刷和在线使用同样有用的格式。
The Portable Document Format (PDF) is the world’s leading page description language, and the first format equally useful for print and online use.
PDF 文档现在在印刷行业、文档交换和分页内容的在线分发中几乎无处不在。然而,它们被广泛认为是不透明和微妙的,并且很少被理解,即使是那些具有技术倾向的人也是如此。
PDF documents are now almost ubiquitous in the printing industry, in document interchange, and in the online distribution of paginated content. They are, however, widely viewed as opaque and delicate and are poorly understood, even by those of a technical disposition.
这在一定程度上是由于令人困惑的文档缺乏;文件格式参考是免费提供的,但它的大小和复杂性需要时间投入,这对于大多数使用 PDF 的人来说不太可能。
This is partly due to a perplexing lack of documentation; the file format reference is freely available, but is of a size and complexity which requires a time investment unlikely to be plausible for the majority of those working with PDF.
本书旨在成为一本平易近人的入门书。它既适合有技术头脑的人,也适合那些只想了解一点 PDF 格式以便为他们使用生成或处理 PDF 文档的工具的工作提供上下文的人。
This book aims to be an approachable introduction. It is suitable both for the technically-minded, and for those who just want to understand a little of the PDF format to give context to their work with tools which produce or process PDF documents.
我们试图写一本书作为一般介绍,有一些可选的技术插曲,让您有机会输入示例 PDF 文件并查看它们如何显示。
We’ve tried to write a book which serves as a general introduction, with some optional technical interludes, giving you the chance to type in example PDF files and see how they display.
本书适合:
This book is suitable for:
Adobe Acrobat 用户希望了解其提供的功能背后的原因,而不仅仅是如何使用它们。例如:加密选项、裁切框和页面标签。
Adobe Acrobat users who want to understand the reasons behind the facilities it provides, rather than just how to use them. For example: encryption options, trim and crop boxes, and page labels.
希望使用命令行软件通过合并、拆分和优化来批量处理 PDF 文档的高级用户。
Power users who want to use command-line software to process PDF documents in batches by merging, splitting, and optimizing them.
编写代码以阅读、编辑或创建 PDF 文件的程序员。
Programmers writing code to read, edit, or create PDF files.
希望了解如何使用 PDF 的元数据和工作流功能来构建连贯系统的搜索、电子出版和印刷行业的专业人士。
Industry professionals in search, electronic publishing, and printing who want to understand how to use PDF’s metadata and workflow features to build coherent systems.
在本章中,我们将介绍 PDF 格式的历史并将其置于上下文中。我们将了解 PDF 相对于同类技术的优势,介绍 PDF/X 和 PDF/A 等特殊类型的 PDF 文件,并简要介绍构成典型 PDF 文档的元素。最后,我们将了解 PDF 在工业中的使用方式。
In this chapter, we give a history of the PDF format and put it into context. We look at the advantages PDF has over similar technologies, introduce specialized kinds of PDF files such as PDF/X and PDF/A, and take a brief tour of the elements which comprise a typical PDF document. We conclude by looking at how PDF is used in industry.
我们认真地开始,在文本编辑器中从头开始构建一个简单的 PDF 文件。我们展示了如何将其处理成完全有效的 PDF 并在 PDF 查看器中打开它。我们解释文件的每个组成部分,首先看一下 PDF 语法的各个部分。
We begin in earnest, building a simple PDF file from scratch in a text editor. We show how to process this into a fully valid PDF and open it in a PDF viewer. We explain each component of the file, taking our first look at various parts of the PDF syntax.
在本章中,我们描述了 PDF 文件的布局和内容,以及构建它的对象的语法。我们描述了如何将 PDF 文档从平面文件读取为结构化格式,以及如何从结构化格式写入平面文件。
In this chapter, we describe the layout and content of a PDF file, and the syntax of the objects from which it is built. We describe how a PDF document is read from a flat file into a structured format and, conversely, written from that structured format to a flat file.
在本章中,我们抛开 PDF 文件的位和字节,并考虑其对象的逻辑结构,描述页面及其资源如何排列到文档中。
In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure of its objects, describing how pages and their resources are arranged into a document.
我们描述了如何在 PDF 中创建矢量图形和光栅图像,以及如何处理透明度、颜色空间和图案。我们通过示例进行说明,在 PDF 查看器中显示代码和结果。
We describe how to create vector graphics and raster images in PDF, and how to deal with transparency, color spaces, and patterns. We illustrate with examples, showing the code and the result in a PDF viewer.
在本章中,我们将了解用于使用不同字体和大小构建和显示文本字符串的 PDF 运算符,以及如何构建行和段落。我们描述了 PDF 文档中不同类型的字体和编码,以及它们是如何定义和使用的。我们看一下从 PDF 文档中提取文本的过程。
In this chapter, we look at the PDF operators for building and showing text strings using different fonts and sizes, and how to build lines and paragraphs. We describe the different types of fonts and encodings in PDF documents, and how they are defined and used. We look at the process of text extraction from a PDF document.
在这里,我们讨论的主题与文档的视觉外观不直接相关,而是与辅助数据相关:书签、元数据、超链接、注释和文件附件。对于每一个,我们都描述了它们是如何在 PDF 中定义的,并给出了示例。
Here, we discuss topics not directly related to the visual appearance of the document, but to ancillary data: bookmarks, metadata, hyperlinks, annotations, and file attachments. For each, we describe how they are defined in PDF and give examples.
我们将了解加密和文档权限在 PDF 中的工作原理,并了解如何在 Adobe Reader 中检查加密信息。我们描述了处理 PDF 文件的程序如何读取、写入和编辑加密文档。
We look at how encryption and document permissions work in PDF, and see how to inspect encryption information in Adobe Reader. We describe how programs which process PDF files read, write, and edit encrypted documents.
在本章中,我们展示了如何使用流行的pdftk程序对 PDF 文件进行命令行处理,并查看常见的使用场景。我们描述了像pdftk这样的程序必须在内部做什么才能完成某些任务(例如,合并或拆分文档)。
In this chapter, we show how to use the popular pdftk program for the command-line processing of PDF files, looking at common usage scenarios. We describe what a program such as pdftk has to do internally to achieve certain tasks (for example, merging or splitting documents).
在这里,我们描述了用于查看、转换、编辑和编程 PDF 文件的 Adobe 和开源软件。我们提供进一步文档的来源和其他资源,例如支持和论坛。
Here, we describe both Adobe and open-source software for viewing, converting, editing, and programming with PDF files. We give sources of further documentation and other resources such as support and discussion forums.
我要感谢我的编辑 Simon St.Laurent,他从一开始就对这个项目充满热情。
I should like to thank my editor, Simon St.Laurent, who was enthusiastic about this project from the beginning.
本书使用以下排版约定:
The following typographical conventions are used in this book:
表示新术语、URL、电子邮件地址、文件名和文件扩展名。
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant widthConstant width用于程序列表,以及在段落中引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width
boldConstant width
bold显示应由用户逐字输入的命令或其他文本。
Shows commands or other text that should be typed literally by the user.
Constant width italicConstant width italic显示应替换为用户提供的值或由上下文确定的值的文本。
Shows text that should be replaced with user-supplied values or by values determined by context.
此图标表示警告或注意。
This icon indicates a warning or caution.
本书中的所有 PDF 代码示例都可以从O'Reilly 网站的 zip 存档中下载。本书的文本包含足够的信息来重构这些示例(加密文档除外,不适合手动输入)。
All the PDF code examples in this book are available for download in a zip archive from the O’Reilly website. The text of the book contains enough information to reconstruct these examples (with the exception of encrypted documents, which are not suitable for typing in manually).
这些示例包括本书中图形的 PDF 源。
The examples include the PDF source for the figures in this book.
本书旨在帮助您完成工作。通常,您可以在您的程序和文档中使用本书中的代码。除非您要复制代码的重要部分,否则无需联系我们获得许可。例如,编写一个使用本书中几段代码的程序不需要许可。销售或分发 O'Reilly 书籍中的示例 CD-ROM 需要获得许可。通过引用本书和引用示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中确实需要获得许可。
This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
我们感谢但不要求署名。署名通常包括书名、作者、出版商和 ISBN。例如:“ PDF由 John Whitington (O'Reilly) 解释。版权所有 2012 John Whitington,978-1-449-31002-8。”
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “PDF Explained by John Whitington (O’Reilly). Copyright 2012 John Whitington, 978-1-449-31002-8.”
如果您觉得您对代码示例的使用不属于合理使用或上述许可范围,请随时通过 permissions@oreilly.com与我们联系。
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.
Safari Books Online 是一个按需点播的数字图书馆,可让您轻松搜索超过 7,500 种技术和创意参考书籍和视频,以快速找到您需要的答案。
Safari Books Online is an on-demand digital library that lets you easily search over 7,500 technology and creative reference books and videos to find the answers you need quickly.
通过订阅,您可以在线阅读我们图书馆中的任何页面和观看任何视频。在手机和移动设备上阅读书籍。在新书可供印刷之前访问它们,并获得对正在开发中的手稿的独家访问权并为作者发表反馈。复制和粘贴代码示例、组织您的收藏夹、下载章节、为关键部分添加书签、创建笔记、打印页面,并受益于大量其他省时功能。
With a subscription, you can read any page and watch any video from our library online. Read books on your cell phone and mobile devices. Access new titles before they are available for print, and get exclusive access to manuscripts in development and post feedback for the authors. Copy and paste code samples, organize your favorites, download chapters, bookmark key sections, create notes, print out pages, and benefit from tons of other time-saving features.
O'Reilly Media 已将本书上传到 Safari Books Online 服务。要获得本书以及 O'Reilly 和其他出版商关于类似主题的其他书籍的完整数字访问权限,请在http://my.safaribooksonline.com免费注册。
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full digital access to this book and others on similar topics from O’Reilly and other publishers, sign up for free at http://my.safaribooksonline.com.
请将有关本书的评论和问题发送给出版商:
Please address comments and questions concerning this book to the publisher:
| 奥莱利媒体公司 |
| 1005 Gravenstein 公路北 |
| 塞瓦斯托波尔, CA 95472 |
| 800-998-9938(美国或加拿大) |
| 707-829-0515(国际或本地) |
| 707-829-0104(传真) |
我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以访问此页面:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at:
| http://oreilly.com/catalog/0636920021483 |
要对本书发表评论或提出技术问题,请发送电子邮件至:
To comment or ask technical questions about this book, send email to:
| bookquestions@oreilly.com |
有关我们的书籍、课程、会议和新闻的更多信息,请访问我们的网站http://www.oreilly.com。
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
在 Facebook 上找到我们:http: //facebook.com/oreilly
Find us on Facebook: http://facebook.com/oreilly
在 Twitter 上关注我们:http: //twitter.com/oreillymedia
Follow us on Twitter: http://twitter.com/oreillymedia
在 YouTube 上观看我们:http ://www.youtube.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
便携式文档格式 (PDF) 是描述印刷页面的世界领先语言,也是第一个同样适用于纸质和在线使用的语言。在本章中,我们将浏览它的用途、特性和历史。我们查看了一些有用的免费软件和资源,其中一些我们将在本书后面使用。
The Portable Document Format (PDF) is the world’s leading language for describing the printed page, and the first one equally suitable for paper and online use. In this chapter, we take a tour of its uses, features, and history. We look at some useful free software and resources, some of which we’ll use later in this book.
今天,我们认为文档的高保真交换是理所当然的,因为我们知道发送到这里的文档在那里看起来是一样的,反之亦然,而且它可能在屏幕和纸上同样显示。并非总是如此。
Today we take the high fidelity exchange of documents for granted, knowing that a document sent here will appear the same there and vice versa, and that it may be displayed equally on screen and on paper. This was not always so.
我们可以在用户之间以及从用户到打印机传递文档,作为一系列位图图片(例如,TIFF 或 PNG),每页一个。但是,这不允许保留任何结构,无法在不损失质量的情况下缩放到不同的纸张尺寸或分辨率,涉及巨大的文件大小,等等。
We could pass documents between users, and from user to printer, as a series of bitmap pictures (e.g., TIFF or PNG), one for each page. However, this doesn’t allow for any structure to be retained, precludes scaling to different paper sizes or resolutions without loss of quality, involves huge file sizes, and so on.
像 PDF 这样的页面描述语言是使用高度结构化的数据描述打印或屏幕页面的内容(文本和图形)的方式,通常带有描述文档各个方面的额外元数据(例如打印信息或文本注释或它如何待查看或打印)。这样,关于文档如何栅格化(通过打印机或在屏幕上转换为像素)的决定可以留到制作过程结束时再做。PDF 文件可以包含文本和关联的字体定义、矢量和位图图形、导航(例如超链接和书签)以及交互式表单。
A page description language like PDF is way of describing the contents (text and graphics) of a printed or onscreen page using highly structured data, often with extra metadata describing various aspects of the document (such as printing information or textual annotations or how it is to be viewed or printed). This way, decisions about how the document is rasterized (converted to pixels by a printer or on screen) can be left until the end of the production process. A PDF file can contain text and associated font definitions, vector and bitmap graphics, navigation (such as hyperlinks and bookmarks), and interactive forms.
PDF 用于内容的准确呈现很重要的地方(例如印刷广告或书籍)。它通常不适合在最后一刻对内容进行布局或重排,例如在可变宽度的网页中——HTML 和 CSS 等将内容与表示分离的语言更适合这些情况。
PDF is used wherever the exact presentation of the content is important (for example for a print advertisement or book). It isn’t normally suitable when the content is to be layed out or reflowed at the last moment, such as in a variable width web page—languages like HTML and CSS which separate content from presentation are more suitable in those circumstances.
当固定字体的文本行打印开始被数字图形打印取代时,创建了许多页面描述语言。然后打印机将处理该语言以生成适当分辨率的位图。例如,PostScript (Adobe)、PCL (Hewlett Packard) 和 KPDL (Kyocera)。矢量绘图仪使用更简单的语言(例如,惠普的 HPGL)。
Many page description languages were created when the printing of lines of text in fixed fonts began to be replaced by digital graphics printing. The printer would then process the language to generate a bitmap at the appropriate resolution. For example, PostScript (Adobe), PCL (Hewlett Packard), and KPDL (Kyocera). Simpler languages were used for vector plotters (for example, HPGL from Hewlett Packard).
这些语言的复杂性和功能各不相同。例如,PostScript 文件是完整的程序——执行程序的结果是文档的可视化表示。这些语言通常包含额外的说明来控制文档的除页面内容以外的其他方面,例如从哪个纸盘抽纸或是否双面输出。
These languages varied in complexity and functionality. PostScript files, for example, are full programs—the result of executing the program is the document’s visual representation. These languages often contain extra instructions to control aspects of the document other than the page content, for example which tray paper is drawn from or whether the output is to be duplexed.
PDF 最初是 Adobe 的一个内部项目,旨在创建一种平台中立的文档交换方法。PostScript 在印刷界已经很流行,但不适用于当时的计算机在屏幕上的使用——尤其是随机访问(要呈现 PostScript 文档的第 50 页,必须先处理第 1-49 页)。这个想法是使用 PostScript 图形语言的一个子集和辅助数据来为独立文档创建一种结构化语言,以便在任何计算机上查看(或打印)。
PDF began as an internal project at Adobe to create a platform-neutral method for document interchange. PostScript was already popular in the print community, but wasn’t practical for on screen use with the computers of the day—especially for random access (to render page 50 of a PostScript document, one must process pages 1–49 first). The idea was to use a subset of the PostScript graphics language together with ancillary data to create a structured language for standalone documents to be viewed on (or printed from) any computer.
PDF 1.0 于 1993 年发布,其中 Acrobat Distiller(用于创建和编辑 PDF 文件)和 Acrobat Reader(仅供查看)均为付费程序。美国税务机关开始以 PDF 格式发送税表,并购买许可证以允许其用户免费下载 Acrobat Reader。后来,Acrobat Reader 向所有人免费提供,导致 PDF 广泛用于在线交换文档。
PDF 1.0 was announced in 1993, with Acrobat Distiller (for creating and editing PDF files) and Acrobat Reader (for viewing only), both as paid-for programs. The US Tax Authorities started to ship tax forms as PDFs, purchasing a license to allow their users to download Acrobat Reader for free. Later on, Acrobat Reader was made available to everybody at no cost, leading to the widespread use of PDF for the exchange of documents online.
在接下来的 10 年里,随着印前功能的添加而起步缓慢,PDF 取代 PostScript 成为印刷行业的首选语言。今天,它是唯一值得注意的通用页面描述语言。
Over the next 10 years, after a slow start as prepress features were added, PDF overtook PostScript as the language of choice in the printing industry. Today, it is the only general page description language of note.
当多种格式竞争成为行业标准时,最好的竞争者并不总是胜利者——运气可以介入。不过,在这种情况下,PDF 具有许多独特的优势。我们在这里看其中的一些。
When a number of formats compete to be the industry standard, the best contender is not always the victor—luck can intervene. In this case, though, PDF had a number of singular advantages. We look at some of them here.
与 PostScript 不同,PDF 文档中的任何对象(页面、图形等)都可以在恒定时间内随意访问。这意味着阅读第 150 页并不比第 1 页更难。 线性化是排列文件中对象的过程,这样给定页面所需的所有对象都位于相邻位置。这解释了为什么您可以在 Web 浏览器窗口中快速跳转到正在 Acrobat Reader 中查看的 PDF 中的任何页面——查看器不需要一开始就加载整个文件,它只从服务器获取需要显示的部分每个新页面。
Unlike PostScript, any object (page, graphic etc.) in a PDF document can be accessed at will, in constant time. This means it’s no harder to read page 150 than page 1. Linearization is the process of arranging the objects in the file such that all those needed for a given page are located in adjacent positions. This explains why you can quickly jump to any page in a PDF being viewed in Acrobat Reader in a web browser window—the viewer doesn’t need to load the whole file to begin with, it fetches from the server just the sections needed to display each new page.
流创建是 PDF 格式固有的能力,允许从头到尾按顺序创建文件,即使最终文件大于可用内存也是如此。
Stream creation is the ability inherent in the PDF format to allow files to be created in order, from beginning to end, even if the eventual file is larger than the memory available.
增量更新意味着,在编辑文件时,可以在不修改任何现有部分的情况下将更改写入文件的末尾——这使得保存更改的版本非常快,并且可以用于提供撤消机制(因为以前的版本仍然完好无损)。
Incremental update means that, when editing a file, it’s possible to write the changes to the end of the file without modifying any existing part—this makes saving changed versions very fast, and can be used to provide an undo mechanism (since the previous version is still intact).
PDF 中使用的字体与文档一起嵌入。这意味着无论给定计算机上安装了哪种字体,它都应该始终正确呈现。创建 PDF 文档的程序将从字体中删除不必要的数据(例如规格和未使用的字符),因此文件不会变得过大。PDF 支持所有常见的字体格式,例如 TrueType 和 Type 1。
Fonts used in a PDF are embedded along with the document. This means that it should always be rendered correctly, regardless of which fonts are installed on a given computer. The program creating the PDF document will remove unnecessary data from the font (such as metrics and unused characters), so the file does not become unduly large. PDF supports all common font formats, such as TrueType and Type 1.
大多数 PDF 文件都保留将构成文本的字符形状映射到 Unicode 字符代码的信息。这意味着您可以从文档中复制和粘贴文本,或轻松搜索文本。PDF 的最新发展允许文档中文本的逻辑顺序与页面上的文本布局分开存储,从而保留更多结构化信息。
Most PDF files maintain the information to map the character shapes making up the text to Unicode character codes. This means that you can copy and paste text from a document, or search the text easily. More recent developments in PDF allow the logical order of the text in the document to be stored separately from the layout of the text on the page, preserving yet more structured information.
PDF于2008年由国际标准化组织(ISO)作为开放标准发布。ISO-32000-1:2008文件与Adobe之前发布的PDF文件格式文件大体相同。
PDF was released as an open standard by the International Organization for Standardization (ISO) in 2008. The ISO-32000-1:2008 document is largely the same as the PDF file format document previously released by Adobe.
这种独立性为 PDF 标准提供了合法性和监督,这应该会鼓励其进一步采用。然而,由于没有真正的工具来检测文件是否符合标准(Adobe Reader 会很乐意加载格式错误的文件,所以很多工具都会创建它们),真正的严谨性还需要一段时间。
This independence lends legitimacy and oversight to the PDF standard, which should encourage its further adoption. However, with no real tools for detecting whether a file meets the standard (Adobe Reader will happily load malformed files, so many tools create them), genuine rigor is some time away.
PDF 格式有几个专门的变体——既有标准化的,也有开发中的。这些是 PDF 格式的子集。每个文件都是有效的 PDF 文档,但对使用的设施或内容本身有限制。其中两个,PDF/A 和 PDF/X,现在是 ISO 标准。
There are several specialized variations on the PDF format—both standardized, and in development. These are subsets of the PDF format. Each file is a valid PDF document, but with restrictions on the facilities used or the content itself. Two of these, PDF/A and PDF/X, are now ISO standards.
PDF/A 标准 (ISO 19005-1:2005) 为图书馆、国家档案馆和官僚机构长期存档的文件定义了一套规则。它还需要一个“一致的读者”以特定的方式行事,使用嵌入的字体,使用颜色管理,等等。简而言之,对 PDF/A 的限制是:
The PDF/A Standard (ISO 19005-1:2005) defines a set of rules for documents intended for long-term archiving in libraries, national archives and bureaucracies. It also requires a “conforming reader” to act in certain ways, using the embedded fonts, using color management, and so forth. Briefly, the restrictions on PDF/A are:
无加密
No encryption
要嵌入的所有字体
All fonts to be embedded
需要元数据
Metadata is required
不允许使用 JavaScript
JavaScript is disallowed
仅与设备无关的颜色空间
Device-independent color spaces only
没有音频或视频内容
No audio or video content
有两个级别的 PDF/A 合规性:PDF/A-1b(“ B 级合规性”)需要文档的精确视觉复制。PDF/A-1a(“ A 级合规性”)要求文本可以映射到 Unicode,并且记录文本的顺序和结构,此外还需要精确的视觉再现。
There are two levels of PDF/A compliance: PDF/A-1b (“level B compliance”) requires exact visual reproduction of the document. PDF/A-1a (“level A compliance”) requires that text can be mapped to Unicode, and that the order and structure of the text is documented, in addition to the requirement of exact visual reproduction.
PDF/A 能力中心是一个代表 PDF/A 利益相关者的行业组织。PDF/A 的第二个 ISO 版本正在准备中。
The PDF/A Competence Center is an industry group representing PDF/A stakeholders. A second ISO version of PDF/A is in preparation.
PDF/X 标准是印刷行业图形交换的一系列 ISO 标准,最新的是 PDF/X-5 (ISO 15930-8:2010)。它定义了一些限制:
The PDF/X Standard is a family of ISO standards for graphics exchange in the printing industry, the latest of which is PDF/X-5 (ISO 15930-8:2010). It defines a number of restrictions:
必须嵌入所有字体
All fonts must be embedded
必须嵌入所有图像数据
All image data must be embedded
不能包含声音、电影或不可打印的注释
Cannot contain sound, films or non-printable annotations
没有表格
No forms
没有JavaScript
No JavaScript
有限的压缩算法
Limited compression algorithms
无加密
No encryption
以及一些额外的要求:
and a number of extra requirements:
该文件被标记为 PDF/X 与颠覆(例如,PDF/X-5)
The file is marked as PDF/X with the subversion (e.g., PDF/X-5)
除了正常的页面尺寸外,还需要出血、裁切和/或艺术框。这些框定义了介质的尺寸、可打印区域、最终裁切尺寸等。
Bleed, trim and/or art boxes are required, in addition to the normal page size. These boxes define the size of the media, the printable area, the final cut size, and so on.
如果文件已被 捕获,则会设置一个标志。陷印是在图形对象之间创建小重叠以掩盖多色打印过程中的套准问题的过程。
A flag is set if the file has been trapped. Trapping is the process of creating small overlaps between graphical objects to mask registration problems in multiple color printing processes.
该文件必须包含一个输出意图,其中包含一个描述如何打印的颜色配置文件。
The file must contain an output intent, containing a color profile describing how it is to be printed.
PDF 完全向后兼容(您可以将 PDF 1.0 版文档加载到为 PDF 1.7 设计的程序中)并且大部分向前兼容(为 PDF 1.0 编写的程序通常可以加载 PDF 1.7 文件)。前向兼容性得到保证,因为读者会忽略他们不理解的内容——只有当引入新的压缩方法或对象存储机制时,这才有可能被打破。自 2003 年的 PDF 1.5 以来,此类更改很少。表 1-1总结了 PDF 版本及其功能。
PDF is fully backward compatible (you can load a PDF version 1.0 document into a program designed for PDF 1.7) and mostly forward compatible (programs written for PDF 1.0 can normally load PDF 1.7 files). Forward compatibility is ensured because readers ignore content they don’t understand—it’s only when new compression methods or object storage mechanisms are introduced that this may be broken. Since PDF 1.5 in 2003, such changes have been minimal. PDF versions and their features are summarized in Table 1-1.
表 1-1。PDF 版本 1.0 到 1.7 扩展级别 8 中的功能
Table 1-1. Functionality in PDF versions 1.0 to 1.7 Extension Level 8
| PDF版本 | 杂技阅读器版本 | 推出 | 新特性总结 |
|---|---|---|---|
| 1.0 | 1.0 | 1993 | 首次发布。 |
| 1.1 | 2.0 | 1996年 | 与设备无关的颜色空间、加密(40 位)、文章线程、命名目标和超链接。 |
| 1.2 | 3.0 | 1996年 | AcroForms(交互式表单)、电影和声音、更多压缩方法、Unicode 支持。 |
| 1.3 | 4.0 | 2000 | 更多色彩空间、嵌入(附加)文件、数字签名、注释、屏蔽图像、渐变填充、逻辑文档结构、印前支持。 |
| 1.4 | 5.0 | 2001年 | 透明度、128 位加密、更好的格式支持、XML 元数据流、标记的 PDF、JBIG2 压缩。 |
| 1.5 | 6.0 | 2003年 | 用于更紧凑文件的对象流和交叉引用流、JPEG 2000 支持、XFA 表单、公钥加密、自定义加密方法、可选内容组。 |
| 1.6 | 7.0 | 2004年 | OpenType 字体、3D 内容、AES 加密、新色彩空间。 |
| 1.7(后来的 ISO 32000-1:2008) | 8.0 | 2006年 | XFA 2.4,新的字符串类型,公钥架构的扩展。 |
| 1.7 扩展级别 3 | 9.0 | 2008年 | 256 位 AES 加密。 |
| 1.7 扩展级别 5 | 9.1 | 2009 | XFA 3.0。 |
| 1.7 扩展级别 8 | X | 2011年 | 还不知道。 |
典型的 PDF 文件包含数千个对象、多种压缩机制、不同的字体格式、矢量和光栅图形的混合以及各种元数据和辅助内容。我们在这里简要介绍一下这些元素,以了解上下文——它们将在后面的章节中进行更全面的介绍。
A typical PDF file contains many thousands of objects, multiple compression mechanisms, different font formats, and a mixture of vector and raster graphics together with a wide variety of metadata and ancillary content. We take a brief tour of these elements here, for context—they are covered more fully in later chapters.
PDF 文件可以包含从所有流行格式(Type1、TrueType、OpenType、旧位图字体等)的多种字体绘制的文本。字体文件嵌入在文档中,因此字符形状始终可用,这意味着该文件在任何计算机上都应该呈现相同的效果。支持多种字符编码,包括 Unicode。
A PDF file can contain text drawn from multiple fonts of all popular formats (Type1, TrueType, OpenType, legacy bitmap fonts etc). Font files are embedded in the document, so the character shapes are always available, meaning the file should render the same on any computer. A variety of character encodings are supported, including Unicode.
文本可以填充任何颜色、图案或透明度。一段文本可以用作剪辑其他内容的形状,允许复杂的图形效果,同时文本保持可选择和可编辑。
Text can be filled with any color, pattern, or transparency. A piece of text may be used as a shape to clip other content, allowing complicated graphical effects whilst text remains selectable and editable.
通常,在 PDF 文档中编码了足够的信息以允许提取文本,尽管这个过程并不总是那么简单。
Typically, enough information is encoded in a PDF document to allow text extraction, though the process is not always straightforward.
PDF 中的图形内容基于最初在 Adobe 的 PostScript 语言中使用的模型。它由 直线和曲线构建的路径组成。每个路径都可以被填充, “描边”以画一条线,或两者兼而有之。线条可以有不同的粗细、连接样式和破折号图案。
Graphical content in PDF is based on the model first used in Adobe’s PostScript language. It consists of paths built from straight lines and curves. Each path may be filled, “stroked” to draw a line, or both. Lines can have varying thicknesses, join styles and dash patterns.
路径可以用任何颜色、由其他对象定义的重复图案或两种颜色之间的平滑渐变填充。所有这些选项也适用于描边路径的线条。
Paths may be filled in any color, with a repeating pattern defined by other objects, or with a smooth gradient between two colors. All these options apply also to the lines of stroked paths.
可以使用各种普通或渐变透明度渲染路径,并使用几种不同的混合模式 定义半透明对象的交互方式。出于透明度的目的,可以将对象分组在一起,因此可以一次将单个透明度应用于整个对象组。
Paths can be rendered using a variety of plain or gradient transparencies, with several different blend modes defining how semitransparent objects interact. Objects may be grouped together for the purposes of transparency, so a single transparency can be applied to a whole group of objects at once.
路径可用于裁剪其他对象,以便只显示与裁剪路径重叠的那些对象的部分。这些裁剪区域可以相互嵌套。
Paths can be used to clip other objects, so that only sections of those objects overlapping with the clipping path are shown. These clipping regions may be nested within one another.
PDF 有一种机制,允许图形定义一次,然后在不同的上下文中多次使用。例如,这可以用于重复出现的主题,甚至可以跨越多个页面。
PDF has a mechanism which allows a graphic to be defined once and then used multiple times in different contexts. This can be used, for instance, for a recurring motif, even across more than one page.
PDF 文档可以在多个颜色空间(例如,三分量 RGB 或四分量 CMYK)中包含每个分量 1 到 16 位的位图图像。可以使用各种无损和有损压缩机制来压缩图像。
PDF documents can include bitmap images between 1 and 16 bits per component, in several color spaces (for example, three-component RGB or four-component CMYK). Images can be compressed using a variety of lossless and lossy compression mechanisms.
图像可以以任何比例或旋转放置,用于创建填充图案,并且可以有一个蒙版来定义它们如何使用透明度与它们所放置的背景混合。
Images may be placed at any scale or rotation, used to create a fill pattern, and may have a mask which defines how they use transparency to blend with the background they are placed on.
PDF 可以使用与特定电子或印刷设备(灰度、RGB、CMYK)相关的色彩空间以及与人类色彩感知相关的色彩空间。此外,还有用于印刷行业的色彩空间,例如专色。如果更简单的 PDF 程序(如屏幕查看器)不支持更高级的颜色空间,则存在一些机制可以退回到基本颜色空间。
PDF can use color spaces related to particular electronic or print devices (grayscale, RGB, CMYK) and ones related to human color perception. In addition, there are color spaces for the printing industry such as spot colors. Mechanisms exist for simpler PDF programs (like onscreen viewers) to fall back to basic color spaces if they do not support the more advanced ones.
PDF文档有一套标准的元数据,比如 标题、作者、 关键词等。这些是在图形内容之外定义的,在查看时对文档没有影响。创建者(创建内容的程序)和制作者(编写 PDF 文件的程序)也被记录下来。每个文档还有一组唯一标识符,允许通过工作流跟踪它们。
PDF documents have a set of standard metadata, such as title, author, keywords and so on. These are defined outside the graphical content and have no effect on the document when viewed. The creator (the program which created the content) and producer (the program that wrote the PDF file) are also recorded. Each document also has a set of unique identifiers, allowing them to be tracked through a workflow.
从 PDF 1.4 开始,元数据可以存储在使用 Adobe 的可扩展元数据平台 (XMP) 嵌入 PDF 的 XML(可扩展标记语言)文档中。这定义了一种在 PDF 中存储对象元数据的方法,第三方可以扩展该元数据以保存与其特定工作流程或产品相关的信息。
Since PDF 1.4, the metadata can be stored in an XML (eXtensible Markup Langauge) document embedded in the PDF using Adobe’s Extensible Metadata Platform (XMP). This defines a way to store metadata for objects in the PDF which can be extended by third parties to hold information relevant to their particular workflows or products.
在屏幕上查看时,PDF 文档有两种导航方法:
PDF documents have two methods of navigation, when viewed on screen:
文档大纲,通常称为文档的书签,是文档中目的地的结构化列表,显示在文档旁边。单击其中一个可将视图移动到该页面或位置。
The document outline, commonly known as the document’s bookmarks, is a structured list of destinations within the document, shown alongside it. Clicking on one moves the view to that page or position.
文档文本或图形中的超链接允许用户单击以移动到文档中的其他位置,或打开外部 URL。
Hyperlinks within the text or graphics of a document allow the user to click to move elsewhere within the document, or to open an external URL.
PDF 中的可选内容组允许根据某些其他因素(用户选择、文档是在屏幕上显示还是打印出来、缩放系数)将页面的部分内容组合在一起并显示(或不显示)。可以定义组之间的关系,以便它们相互依赖。其用途之一是模拟图形包中的“图层”。例如,当使用 PDF 查看器阅读 Adobe Illustrator 生成的文档时,图层会被保留。
Optional content groups in PDF allow parts of the content of a page to be grouped together and shown—or not shown—based on some other factor (user choice, whether the document is on screen or printed, the zoom factor). Relationships between groups can be defined, so that they depend upon one another. One use for this is to emulate the “layers” found in graphics packages. For example, Adobe Illustrator layers are preserved when a document it produces is read with a PDF viewer.
PDF 文档可以包含各种多媒体元素。其中很多破坏了 PDF 固有的可移植性,并且在 Adobe 产品之外通常得不到很好的支持。
PDF documents can include various kinds of multimedia elements. A lot of this breaks the portability inherent in PDF, and is often not well supported outside of Adobe products.
可以嵌入声音和电影。
Sounds and movies can be embedded.
可以定义幻灯片放映,以在具有过渡效果的页面之间自动移动。
Slide shows can be defined, to move automatically between pages with transition effects.
引入了一个包含任意媒体类型的更通用的系统。
A more general system for including arbitrary media types was introduced.
可以嵌入 3D 图稿。
3D Artwork can be embedded.
PDF 中有两种不兼容的表单架构:AcroForms,一种开放标准,以及 Adobe XML Forms Architecture (XFA),它有文档记录但需要 Adobe 的商业软件。
There are two incompatible forms architectures in PDF: AcroForms, which is an open standard, and the Adobe XML Forms Architecture (XFA), which is documented but requires commercial software from Adobe.
表单允许用户填写文本字段,并使用复选框和单选按钮。数据完成后,可以将其保存到文档中(如果允许)或提交到 URL 以供进一步处理。嵌入式 JavaScript 通常与表单结合使用,以处理字段值的验证或类似任务。
Forms allow users to fill in text fields, and use check boxes and radio buttons. When the data is complete, it may be saved into the document (if allowed) or submitted to a URL for further processing. Embedded JavaScript is often used in conjunction with forms to deal with verification of field values or similar tasks.
逻辑结构工具允许将有关结构内容(章、节、图、表和脚注)的信息与图形内容一起包含在内。特定元素可由第三方定制。
Logical structure facilities allow information about the structural content (chapters, sections, figures, tables, and footnotes) to be included alongside the graphical content. The particular elements are customizable by third parties.
带标签的PDF具有基于一组 Adobe 定义的元素的逻辑结构。遵循这些约定的文件可以被阅读器重排,以不同的页面大小或文本大小显示相同的文本,例如在电子书阅读器中。
A tagged PDF is one which has logical structure based on a set of Adobe-defined elements. Files following these conventions can be reflowed by a reader to display the same text in a different page size or text size, for example in an ebook reader.
为了安全起见,可以使用 RC4 或 AES 加密方法对 PDF 文档进行加密。有两个密码——所有者密码和用户密码。所有者密码解锁文件的所有更改,用户密码仅允许所有者在文件最初加密时选择的一系列操作(例如,允许或禁止打印或文本提取)。通常用户密码为空,因此文件似乎可以正常打开,但功能受到限制。
PDF documents can be encrypted for security, using RC4 or AES encryption methods. There are two passwords—the owner password and the user password. The owner password unlocks the file for all changes, the user password just allows a range of operations selected by the owner when the file was originally encrypted (for example, allowing or disallowing printing or text extraction). Frequently the user password is blank, so the file appears to open as normal, but functionality is restricted.
从 PDF 1.3 开始,数字签名可用于验证用户身份或文档内容。
Starting with PDF 1.3, digital signatures can be used to authenticate the identity of a user or the contents of the document.
PDF 中的图像和其他数据流可以使用第三方定义的各种无损和有损方法进行压缩。通过仅压缩这些流(而不是整个文件),PDF 对象的结构始终可用而无需解压缩整个文件,并且仅在需要时才处理压缩部分。有几组压缩方法:
Images and other data streams in PDF can be compressed using a variety of lossless and lossy methods defined by third parties. By compressing only these streams (rather than the whole file), the structure of the PDF objects is always available without decompressing the whole file, and compressed sections can be processed only when needed. There are several groups of compression methods:
用于双级(例如,黑色和白色)图像的无损压缩。PDF 支持双层图像的标准传真编码方法,从 PDF 1.4 开始支持 JBIG2 标准,它为同类图像提供更好的压缩。
Lossless compression for bi-level (e.g., black and white) images. PDF supports the standard fax encoding methods for bi-level images and, from PDF 1.4, the JBIG2 standard, which provides better compression for the same class of images.
有损图像过滤器,例如 JPEG 和 PDF 1.5 中的 JPEG2000。
Lossy image filters such as JPEG and, from PDF 1.5, JPEG2000.
适用于图像数据和一般数据压缩的无损压缩机制,例如 Flate(The zip 算法)、Lempel-Ziv-Welch (LZW) 和游程编码。
Lossless compression mechanisms suitable for image data and general data compression, such as Flate (The zip algorithm), Lempel-Ziv-Welch (LZW) and run length encoding.
PDF 用于各种行业和专业。我们在这里描述了一些,解释了为什么 PDF 适合每一个。
PDF is used in a wide variety of industries and professions. We describe some here, explaining why PDF is suitable for each.
PDF 支持色彩空间、页面尺寸信息(例如媒体、裁剪、艺术和出血框)、陷印支持和商业印刷所需的分辨率独立性。与其他技术一起,PDF 是出版印刷工作流程的关键部分。PDF 元数据的可扩展性允许各种方案将额外数据与文档一起包括在内,并在整个发布过程中将其与文档保持在一起——不理解特定元数据的部分工作流程至少会保留它。
PDF has support for the color spaces, page dimension information (such as media, crop, art and bleed boxes), trapping support, and resolution-independence required for commercial printing. Together with other technologies, PDF is the key part of the publishing-for-print workflow. The extensibility of PDF metadata allows various schemes for including extra data along with the document, and for keeping it with the document throughout the publishing process—parts of the workflow which don’t understand a particular piece of metadata will at least preserve it.
本书是使用DocBook 系统创建的,该系统采用 XML 格式的结构化文档,对其进行排版,并生成带有超链接和书签的完整 PDF 以及适合打印的更传统的 PDF。
This book was created using the DocBook system, which takes a structured document in XML format, typesets it, and produces a PDF complete with hyperlinks and bookmarks, together with a more traditional PDF suitable for printing.
PDF 是竞争性电子书格式之一。为了支持在各种屏幕上显示,PDF 文档可能会被标记为重排信息,允许文本行在每个设备上以不同的宽度显示。这与 PDF 的其他用途不一致,后者要求固定文本布局。
PDF is one of the competing eBook formats. To support display on a wide range of screens, PDF documents may be tagged with reflow information, allowing lines of text to be displayed at differing widths on each device. This is at odds with the other uses of PDF, where fixed text layout is a requirement.
当现有的基于纸张的系统正在过渡到电子系统,或者必须与它们并存时,PDF 表格特别有用。PDF 表格(在线填写然后打印出来)看起来与在纸上手动填写的表格相同,并且可以由现有的人类和计算机系统以相同的方式处理。
PDF forms are especially useful when existing paper-based systems are being transitioned to electronic ones, or must exist alongside them. A PDF form (filled in online then printed out) looks the same as one filled in manually on paper, and may be processed by existing human and computer systems in the same way.
从 PDF 查看器中自动提交表单,使用 JavaScript 来增加智能(例如,确保数字在税表中相加),以及使用数字签名来签署填写的表单都是令人信服的使用理由电子表格的 PDF。
Automatic submission of forms from within the PDF viewer, the use of JavaScript to add intelligence (making sure figures add up in a tax form, for example), and the use of digital signatures to sign filled-in forms are all compelling reasons to use PDF for electronic forms.
通过 PDF/A,PDF 成为长期存档的理想格式,结合了扫描和电子内容的准确表示,以及 Unicode 语言支持和各种数据的压缩机制,包括重要的 CCITT 传真和用于单色图像的 JBIG2 方法. 作为 ISO 标准(并且几乎无处不在)保证这些文档可以在未来很长一段时间内被阅读。
Through PDF/A, PDF is the ideal format for long-term archiving, combining accurate representations of scanned and electronic content, together with Unicode language support, and compression mechanisms for all sorts of data including the important CCITT Fax and JBIG2 methods for monochrome images. Being an ISO standard (and one which is near-ubiquitous) guarantees that these documents can be read long into the future.
PDF 可用于光学字符识别 (OCR),允许从原始文本创建可搜索的文本,准确的视觉表示与识别的文本一起保留。
PDF can be used for Optical Character Recognition (OCR), allowing searchable text to be created from the original, the exact visual representation being retained alongside the recognized text.
乍一看,PDF 不适合用作可编辑的矢量图形格式。例如,圆不会像圆一样保持可编辑状态,因为它已转换为多条曲线(PDF 中没有圆元素)。
PDF is not, at first sight, suitable for use as an editable vector graphics format. For example, a circle won’t remain editable as a circle, since it will have been converted to a number of curves (there is no circle element in PDF).
但是,如果适当利用其可扩展性来存储辅助数据,则不失为一个很好的解决方案。例如,Adobe Illustrator 现在使用 PDF 的扩展形式作为其文件格式。该文件可以在任何 PDF 查看器中查看,但 Illustrator 可以在将扩展数据加载回程序时使用扩展数据。
However, if appropriate use is made of its extensibility to store auxiliary data, it makes a good solution. Adobe Illustrator, for example, now uses an extended form of PDF as its file format. The file can be viewed in any PDF viewer but Illustrator can make use of the extended data when it is loaded back into the program.
在本书中,我们使用各种软件来帮助我们举例。幸运的是,您需要的一切都可以免费获得。您需要一个 PDF 查看器:
In this book, we use various pieces of software to help us with examples. Luckily, everything you need is freely available. You’ll need a PDF viewer:
Acrobat Reader是 Adobe 自己的 PDF 查看器。它支持 PDF 的所有版本和功能,并且在大多数平台上都带有浏览器插件。它适用于 Microsoft Windows、Mac OS X、Linux、Solaris 和 Android。
Acrobat Reader is Adobe’s own PDF viewer. It supports all versions and features of PDF and comes with a browser plug-in on most platforms. It’s available for Microsoft Windows, Mac OS X, Linux, Solaris, and Android.
Preview是 Mac OS X 上 PDF 文档的预装 PDF 查看器和浏览器插件。它功能强大,速度非常快,但不支持 Acrobat Reader 的所有功能。许多人坚持将预览作为 PDF 文件的默认应用程序,但也会安装 Acrobat Reader。
Preview is the pre-installed PDF viewer and browser plug-in for PDF documents on Mac OS X. It’s highly capable, and very fast, but doesn’t support everything that Acrobat Reader does. Many people stick with Preview as the default application for PDF files, but install Acrobat Reader as well.
Xpdf是 Unix 的开源 PDF 查看器。它支持合理的 PDF 子集。
Xpdf is an open source PDF viewer for Unix. It supports a reasonable subset of PDF.
gv 是 GhostScript 的 PostScript 和 PDF 查看器前端(见下文)。它可以呈现几乎所有文档的文本和图形内容。但是,它缺少其他 PDF 查看器的大部分交互功能。
gv is a PostScript and PDF viewer frontend for GhostScript (see below). It can render the textual and graphical content of almost all documents. However, it lacks most of the interactive features of other PDF viewers.
有两个关键的命令行工具:
There are two key command-line tools:
pdftk 是一个多平台命令行工具,用于以各种方式处理 PDF 文件。它可以以适用于 Microsoft Windows、Mac OS X 和 Linux 的预构建形式以及源代码形式下载。
pdftk is a multiplatform command-line tool for processing PDF files in various ways. It can be downloaded in pre-built form for Microsoft Windows, Mac OS X, and Linux, as well as in source code form.
Ghostscript 是一组工具,包括 PostScript 和 PDF 的解释器。它可用于呈现 PDF 文件,并从命令行以各种方式处理它们。它在 Microsoft Windows 上以二进制形式提供,在所有平台上以源代码形式提供。
Ghostscript is a set of tools including an interpreter for PostScript and PDF. It can be used to render PDF files, and to process them in various ways from the command line. It is available in binary form for Microsoft Windows, and in source code form for all platforms.
第 10 章全面讨论了 Adobe 和开源 PDF 软件。
A full discussion of Adobe and open-source PDF software is in Chapter 10.
在本章中,我们将在文本编辑器中手动构建 PDF 内容。然后我们将使用免费的pdftk程序将其转换为有效的 PDF 文件并在 PDF 查看器中查看输出。
In this chapter, we’ll build PDF content manually in a text editor. Then we’ll use the free pdftk program to turn it into a valid PDF file and look at the output in a PDF viewer.
这个例子连同本书中的所有 PDF 文件都可以从本书的网页上下载。
This example, together with all the PDF files in this book, can be downloaded from the web page for this book.
我们将同时研究很多新概念,所以如果它看起来让人不知所措,请不要担心——我们将在以后的章节中回过头来讨论所有这些。
We’ll be looking at a lot of new concepts all at once, so don’t worry if it seems overwhelming—we’ll come back to all of this in future chapters.
一个 PDF 文件至少包含三种不同的语言:
A PDF file contains at least three distinct languages:
文档内容,它是一些对象,它们之间的链接形成一个有向图。这些对象描述了文档的结构(页面、元数据、字体和资源)。
The document content, which is a number of objects with links between them forming a directed graph. These objects describe the structure of the document (pages, metadata, fonts, and resources).
页面内容,使用一系列用于将文本和图形放置在单个页面上的运算符进行描述。
The page content, described using a series of operators for placing text and graphics on a single page.
文件结构, 由头、尾和 交叉引用表组成,帮助程序定位和读取文件内容。
The file structure, consisting of a header, trailer, and cross-reference table helping programs to locate and read the file’s contents.
文档内容由以下元素构建的对象组成:
The document content consists of objects built out of, amongst others, the following elements:
名字,写成/Name.
Names, written as /Name.
整数,比如50.
Integers, like 50.
字符串,用方括号引入,例如(The Quick Brown Fox).
Strings, introduced with brackets, like (The Quick Brown Fox).
对其他对象2 0
R的引用,例如对对象 2 的引用。
References to other objects like 2 0
R, a reference to object 2.
对象数组(有序集合),例如[50 30 /Fred],一个包含三个项目的数组,顺序为:50、30和/Fred。
Arrays (ordered collections) of objects, like [50 30 /Fred], an array of three items, in
order: 50, 30, and /Fred.
字典(从名称到对象的无序映射),例如
,<< /Three 3 /Five 5
>>它映射/Three到3和/Five。5
Dictionaries (unordered maps from names to objects), like
<< /Three 3 /Five 5
>>, which maps /Three to 3 and /Five to 5.
流,由字典和一些二进制数据组成。这些用于存储 PDF 图形运算符流和其他二进制数据,例如图像和字体。
Streams, which consist of a dictionary and some binary data. These are used to store streams of PDF graphics operators, and other binary data such as images and fonts.
例如,这是一个页面对象,它是一个包含许多项目的字典,每个项目都与一个名称相关联:
For example, here’s a page object, which is a dictionary containing a number of items, each associated with a name:
<< /Type /Page /MediaBox [0 0 612 792] /Resources 3 0 R /Parent 1 0 R /Contents [4 0 R] >>
这本词典包含五个词条:
This dictionary contains five entries:
/Type /Page/Type /Page该名称/Page与字典键相关联/Type。
The name /Page is
associated with the dictionary key /Type.
/MediaBox [0 0 612
792]/MediaBox [0 0 612
792]四个整数的数组[0 0 612
792]与字典键相关联/MediaBox。
The array of four integers [0 0 612
792] is associated with the dictionary key /MediaBox.
/Resources 3 0 R/Resources 3 0 R对象编号 3 与字典键相关联
/Resources。
Object number 3 is associated with the dictionary key
/Resources.
/Parent 1 0 R/Parent 1 0 R对象编号 1 与字典键相关联
/Parent。
Object number 1 is associated with the dictionary key
/Parent.
/Contents [4 0 R]/Contents [4 0 R]间接引用的单元素数组[4 0 R]与字典键相关联/Contents。
The one-element array of indirect references [4 0 R] is associated with the
dictionary key /Contents.
页面内容是一个运算符列表,每个运算符前面都有零个或多个操作数。下面是一系列用于
/F0在 36 点处选择字体并将文本放置在当前位置的运算符:
The page content is a list of operators, each of which is preceded
by zero or more operands. Here’s a series of operators for selecting the
/F0 font at 36 points and placing
text at the current position:
/F0 36.0 Tf (Hello, World!) Tj
这里,Tf和Tj是运算符,和/F0,,36.0和(Hello, World!)是操作数。您可以看到一些句法元素(例如名称和字符串)在用于页面内容和文档内容的语言之间共享。
Here, Tf and Tj are the operators, and /F0, 36.0,
and (Hello, World!) are the operands.
You can see that some syntactic elements (names and strings, for
example) are shared across the languages used for both page content and
document content.
文件结构包括:
The file structure consists of:
将文件区分为 PDF 文档的标题。
A header to distinguish the file as a PDF document.
列出文档中每个对象的字节偏移量的交叉引用表——这允许任意访问对象,而不必按顺序读取。
A cross-reference table listing the byte offsets of each object in the document—this allows the objects to be accessed arbitrarily, rather than having to be read in order.
trailer ,其中包括交叉引用表的字节偏移量,后跟文件结束标记。
The trailer, which includes the byte offset of the cross-reference table, followed by an end-of-file marker.
在编写我们的示例文件时,我们将对很多文件结构使用不完整的值,依靠pdftk来填充细节。例如,我们手动编写交叉引用表是不切实际的。
When writing our example file, we’ll use incomplete values for a lot of the file structure, relying on pdftk to fill in the details. For example, it’s impractical for us to write the cross-reference table manually.
我们将构建的示例只是最简单的有意义的 PDF 文件。然而,它需要数量惊人的元素。除了我们上面描述的文件结构之外,一个最小的 PDF 文档必须包含一些基本部分:
The example we’ll be building is just about the simplest meaningful PDF file. However, it needs a surprisingly large number of elements. In addition to the file structure we’ve described above, a minimal PDF document must have a number of basic sections present:
trailer dictionary,它提供有关如何读取文件中其余对象的信息。
The trailer dictionary, which provides information about how to read the rest of the objects in the file.
文档目录,它是对象图的根。
The document catalog, which is the root of the object graph.
页面树,枚举文档中的页面。
The page tree, which enumerates the pages in the document.
至少一页。每个页面必须有:
它的资源,例如包括字体。
它的页面内容,包含在页面上绘制文字和图形的说明。
At least one page. Each page must have:
Its resources, which include, for example, fonts.
Its page content, which contains the instructions for drawing text and graphics on the page.
这种安排如图 2-1 所示。
This arrangement is illustrated in Figure 2-1.
图 2-1。Hello, World! 的对象图 PDF,括号中的对象编号来自示例 2-1
Figure 2-1. Object graph for Hello, World! PDF, with object numbers in brackets from Example 2-1
我们会将 PDF 数据键入文本文件。文本编辑器选择的行尾并不重要(<CR> [Unix 和 Mac OS X] 和 <CR><LF> [Microsoft Windows] 都可以)。我们将跳过一些信息(手动难以计算的数据),然后依靠pdftk来填充它。我们会:
We’ll type the PDF data into a text file. The line endings chosen by your text editor are unimportant (<CR> [Unix and Mac OS X] and <CR><LF> [Microsoft Windows] are both fine). We’re going to skip some information (the data that is hard to work out manually), relying on pdftk to fill it in afterward. We will:
使用缩写标题。
Use an abbreviated header.
漏掉页面内容流的长度,这样我们就不用手动统计字节数了。
Miss out the length of the page content stream, so we don’t have to manually count the number of bytes.
省略了几乎所有的交叉引用表。
Omit almost all of the cross-reference table.
用于0交叉引用表的字节偏移量,再次避免手动计数。
Use 0 for the byte offset of
the cross-reference table, again to avoid having to count it
manually.
首先,我们将查看文件的各个部分(按照它们出现的顺序),然后将它们放在一起并运行pdftk以生成有效的 PDF 文件。
First, we’ll look at the sections of the file (in the order in which they appear) and then we’ll put them together and run pdftk to make a valid PDF file.
文件头通常由两行组成。第一个将文件标识为 PDF 并给出其版本号:
The file header usually consists of two lines. The first identifies the file as a PDF and gives its version number:
%PDF-1.0 PDF version 1.0 header第二行很难在文本编辑器中输入,因为它包含不可打印的字符。我们会让pdftk为我们做这件事。
The second line is hard to type into a text editor since it contains nonprintable characters. We’ll have pdftk do this for us.
关于文件的主体——对象。第一个是 页面列表,它是链接到文档中页面对象的字典。
On to the main body of the file—the objects. The first is the page list, which is a dictionary linking to the page objects in the document.
1 0 obj Object 1 << /Type /Pages It's a page list /Count 1 There is one page /Kids [2 0 R] List of object numbers of pages. Just object 2 here. >> endobj End of object 1
接下来是页面。同样,它是一本字典。它包含纸张大小、对页面列表的间接引用以及对图形内容和 资源的引用。
Next up is the page. Again, it’s a dictionary. It contains the paper size, an indirect reference back to the page list, and to the graphical content and resources.
2 0 obj << /Type /Page It's a page /MediaBox [0 0 612 792] Paper size is US Letter Portrait (612 points by 792 points) /Resources 3 0 R Reference to resources at object 3 /Parent 1 0 R Reference back up to parent page list /Contents [4 0 R] Graphical content is in object 4 >> endobj
现在,资源。这里只有一个条目,即字体字典,在我们的示例中它包含一种字体,我们将使用它在页面上写入一些文本。
Now, the resources. Here, there is just one entry, the font dictionary, which in our example contains a single font, which we’re going to use to write some text on the page.
3 0 obj << /Font The font dictionary << /F0 Just one font, called /F0 << /Type /Font These three lines reference the built-in font Times Italic /BaseFont /Times-Italic /Subtype /Type1 >> >> >> endobj
页面内容流包含一系列用于在页面上放置文本和图形的运算符。它是由/Contents页面字典中的条目链接到的。
The page contents stream contains a sequence
of operators for placing text and graphics on the page. It was linked to
by the /Contents entry in the page
dictionary.
流对象由字典和原始数据流组成,其中包含一系列 PDF 操作数和运算符。通常,这将被压缩以减小文件大小,但我们手动输入它,所以我们不压缩它。我们还必须以字节为单位指定流的长度——pdftk将为我们将所需的/Length条目添加到流字典中。
A stream object consists of a dictionary followed by a raw data
stream, containing a series of PDF operands and operators. Normally,
this would be compressed to reduce file size, but we’re typing it in
manually, so we don’t compress it. We must also specify the length of
the stream in bytes—pdftk will add
the required /Length entry to the
stream dictionary for us.
4 0 obj The page contents stream << >> stream Beginning of stream 1. 0. 0. 1. 50. 700. cm Position at (50, 700) BT Begin text block /F0 36. Tf Select /F0 font at 36pt (Hello, World!) Tj Place the text string ET End text block endstream End of stream endobj
图 2-2显示了页面上图形操作流的结果。
The result of this stream of graphics operators on the page is shown in Figure 2-2.
文件的最后一部分以文档目录开始,它是对象图的根对象。接下来是交叉引用表,它给出了文件中每个对象的字节偏移量。我们会让
pdftk为我们填写这个。最后两行:第一行给出了交叉引用表开始的字节偏移量(我们编写0,
pdftk会为我们替换它)。最后,文件结束标记%%EOF。
The last part of the file starts with the document
catalog, which is the root object of the object graph. There
follows the cross-reference table, which gives the
byte offsets of each object in the file. We’ll have
pdftk fill this in for us. There are two final
lines: one gives the byte offset of the start of the cross-reference
table (we write 0 and
pdftk will replace it for us). Finally, the
end-of-file marker %%EOF.
5 0 obj << /Type /Catalog The document catalog /Pages 1 0 R Reference to the page list >> endobj xref Start of cross-reference table, which we have missed out 0 6 trailer << /Size 6 Number of lines in cross-reference table (number of objects plus one) /Root 5 0 R Reference to the document catalog >> startxref 0 Byte offset of start of xref table, which we have set to 0 %%EOF End of file marker
现在我们准备好将这些部分组合在一起。
Now we’re ready to put these pieces together.
该文件的源代码(示例 2-1)可以在本书的在线资源中找到,或者您可以自己输入。将其另存为hello-broken.pdf。
The source for this file (Example 2-1) can be found in the online resources for this book, or you can type it in yourself. Save it as hello-broken.pdf.
示例 2-1。适合手动创建的无效hello-broken.pdf PDF文件
Example 2-1. The invalid hello-broken.pdf PDF file suitable for manual creation
%PDF-1.0 File header 1 0 obj Main objects << /Type /Pages /Count 1 /Kids [2 0 R] >> endobj 2 0 obj << /Type /Page /MediaBox [0 0 612 792] /Resources 3 0 R /Parent 1 0 R /Contents [4 0 R] >> endobj 3 0 obj << /Font << /F0 << /Type /Font /BaseFont /Times-Italic /Subtype /Type1 >> >> >> endobj 4 0 obj Graphical content << >> stream 1. 0. 0. 1. 50. 700. cm BT /F0 36. Tf (Hello, World!) Tj ET endstream endobj 5 0 obj Catalog, cross-reference table, and trailer << /Type /Catalog /Pages 1 0 R >> endobj xref 0 6 trailer << /Size 6 /Root 5 0 R >>startxref 0 %%EOF
就目前而言,hello-broken.pdf这不是一个有效的 PDF 文件,甚至 Adobe Reader(对格式错误的文件相当宽容)也无法处理它。
As it stands, hello-broken.pdf is
not a valid PDF file, and even Adobe Reader (which is fairly tolerant of
malformed files) won’t cope with it.
我们可以使用免费的pdftk工具修复缺少细节的hello-broken.pdf文件,将输出写入hello.pdf:
We can use the free pdftk tool to fix up the hello-broken.pdf file with the missing details, writing the output to hello.pdf:
pdftk hello-broken.pdf output
hello.pdf
pdftk hello-broken.pdf output
hello.pdf
pdftk读取文件及其对象,并为我们编写的缺失或不正确的部分计算正确的数据,并生成如示例 2-2所示的有效文件。请注意,某些语法的间距和格式已更改——每个 PDF 制作者对此做出的选择略有不同。
pdftk reads the file and its objects, and calculates the correct data for the missing or incorrect sections we wrote, and produces the valid file shown in Example 2-2. Note that the spacing and formatting of some of the syntax has been altered—each PDF producer makes slightly different choices about this.
示例 2-2。完成的 PDF 文件 hello.pdf,由 pdftk 修复
Example 2-2. The completed PDF file hello.pdf, fixed by pdftk
%PDF-1.0 %âãÏÓ 1 0 obj << /Kids [2 0 R] /Count 1 /Type /Pages >> endobj 2 0 obj << /Rotate 0 /Parent 1 0 R /Resources 3 0 R /MediaBox [0 0 612 792] /Contents [4 0 R] /Type /Page >> endobj 3 0 obj << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> endobj 4 0 obj << /Length 65 >> stream 1. 0. 0. 1. 50. 700. cm BT /F0 36. Tf (Hello, World!) Tj ET endstream endobj 5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj xref 0 6 0000000000 65535 f 0000000015 00000 n 0000000074 00000 n 0000000192 00000 n 0000000291 00000 n 0000000409 00000 n trailer << /Root 5 0 R /Size 6 >> startxref 459 %%EOF
一些不可打印的字符已添加到 PDF 标题中——这可确保文件被 FTP 等文件传输程序识别为二进制(而不是文本)。
Some nonprintable characters have been added to the PDF header—this ensures that the file is recognized as binary (rather than text) by, for example, file transfer programs such as FTP.
已填充流的字节长度。
The length in bytes of the stream has been filled in.
交叉引用表已填入文件中每个对象的字节偏移量。
The cross-reference table has been filled in with the byte offsets of each object in the file.
交叉引用表开始的字节偏移量已被填充。
The byte offset of the start of the cross-reference table has been filled in.
现在可以将该文件加载到 PDF 查看器中。Microsoft Windows 上的 Acrobat Reader 中的结果如图 2-3所示。
The file can now be loaded into a PDF viewer. The result in Acrobat Reader on Microsoft Windows is shown in Figure 2-3.
我们已经了解了如何使用pdftk来帮助我们从头开始构建一个简单的 PDF 文件 ,并且我们已经了解了构成 PDF 文档的一些基本语法。
We’ve seen how to build a simple PDF file from scratch, using pdftk to help us, and we’ve looked at some of the basic syntax that makes up a PDF document.
您也可以使用文本编辑器查看现有的 PDF 文件。但是,某些数据(例如构成页面内容的图形运算符)可能会被压缩,因此无法读取。pdftk命令可用于解压缩这些部分以便于阅读——请参阅压缩。
You can look at existing PDF files using your text editor too. However, some of the data (such as the graphics operators making up the page content) is likely to be compressed and thus unreadable. The pdftk command can be used to decompress these sections for easier reading—see Compression.
在以后的章节中,我们将详细了解典型 PDF 文件的各个部分,以及程序如何读取、写入和编辑 PDF 文件。在每个阶段,我们都有机会通过更改和扩展我们在本章中构建的示例来构建示例文件。
In future chapters, we’ll look at the parts of a typical PDF file in some detail and how programs read, write, and edit PDF files. At each stage, there will be the opportunity to build example files by altering and extending the example we built in this chapter.
在本章中,我们描述了 PDF 文件四个主要部分的布局和内容,以及构成每个部分的对象的语法。我们还概述了将 PDF 文件读入高级数据结构的过程,以及将该结构写入 PDF 文件的逆向操作。
In this chapter, we describe the layout and content of the PDF file’s four main sections, and the syntax of the objects which make up each one. We also outline the process of reading a PDF file into a high level data structure, and the converse operation of writing that structure to a PDF file.
一个简单有效的 PDF 文件有四个部分,顺序如下:
A simple valid PDF file has four parts, in order:
header,它给出了 PDF 版本号。
The header, which gives the PDF version number.
正文,包含页面、图形内容和许多辅助信息,所有这些都编码为一系列对象。
The body, containing the pages, graphical content, and much of the ancillary information, all encoded as a series of objects.
交叉引用表,其中列出了每个对象在文件中的位置,以方便随机访问。
The cross-reference table, which lists the position of each object within the file, to facilitate random access.
trailer包括 trailer dictionary,它有助于定位文件的每个部分并列出可以在不处理整个文件的情况下读取的各种元数据。
The trailer including the trailer dictionary, which helps to locate each part of the file and lists various pieces of metadata which can be read without processing the whole file.
作为参考,我们将第 2 章中的“ Hello, World ” PDF 重现为示例 3-1。四个部分中每个部分的第一行都已注释。
For reference, we reproduce the “Hello, World” PDF from Chapter 2 as Example 3-1. The first line of each of the four sections has been annotated.
示例 3-1。一个小的 PDF 文件
Example 3-1. A small PDF file
%PDF-1.0 Header starts here %âãÏÓ 1 0 obj Body starts here << /Kids [2 0 R] /Count 1 /Type /Pages >> endobj 2 0 obj << /Rotate 0 /Parent 1 0 R /Resources 3 0 R /MediaBox [0 0 612 792] /Contents [4 0 R] /Type /Page >> endobj 3 0 obj << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> endobj 4 0 obj << /Length 65 >> stream 1. 0. 0. 1. 50. 700. cm BT /F0 36. Tf (Hello, World!) Tj ET endstream endobj 5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj xref Cross-reference table starts here 0 6 0000000000 65535 f 0000000015 00000 n 0000000074 00000 n 0000000192 00000 n 0000000291 00000 n 0000000409 00000 n trailer Trailer starts here << /Root 5 0 R /Size 6 >> startxref 459 %%EOF
现在,我们以例 3-1为参考,依次仔细研究这四个部分中的每一个 。
We now take a closer look at each of these four parts in turn, using Example 3-1 for reference.
PDF 文件的第一行给出了文档的版本号。在我们的示例中,这是:
The first line of a PDF file gives the version number of the document. In our example, this is:
%PDF-1.0
%PDF-1.0
这将文件定义为 PDF 版本 1.0。PDF 是向后兼容的,因此 PDF 1.3 文档应该可以被了解 PDF 1.5 的程序读取。在大多数情况下,它也是向前兼容的,因此大多数 PDF 程序都会尝试读取任何文件,无论假定的版本号是多少。
This defines the file as PDF version 1.0. PDF is backward compatible, so a PDF 1.3 document should be readable by a program which knows about, for example, PDF 1.5. It is also, for the most part, forward compatible, so most PDF programs will attempt to read any file, no matter what the supposed version number is.
由于 PDF 文件几乎总是包含二进制数据,因此如果更改行结尾(例如,如果文件以文本模式通过 FTP 传输),它们可能会损坏。为了让遗留文件传输程序确定文件是二进制文件,通常在文件头中包含一些字符代码高于 127 的字节。例如:
Since PDF files almost always contain binary data, they can become corrupted if line endings are changed (for example, if the file is transferred over FTP in text mode). To allow legacy file transfer programs to determine that the file is binary, it is usual to include some bytes with character codes higher than 127 in the header. For example:
%âãÏÓ
%âãÏÓ
百分号表示另一个标题行,其他几个字节是超过 127 的任意字符代码。因此,我们示例中的整个标题是:
The percent sign indicates another header line, the other few bytes are arbitrary character codes in excess of 127. So, the whole header in our example is:
%PDF-1.0 %âãÏÓ
%PDF-1.0 %âãÏÓ
文件主体由一系列对象组成,每个对象前面一行是对象编号、世代编号和obj
关键字,另一行是endobj关键字。例如:
The file body consists of a sequence of objects, each preceded by
an object number, generation
number, and the obj
keyword on one line, and followed by the endobj keyword on another. For example:
1 0 obj << /Kids [2 0 R] /Count 1 /Type /Pages >> endobj
此处,对象编号为 1,世代编号为 0(几乎总是如此)。对象 1 的内容在两行
1 0 obj和之间endobj。在这种情况下,它是字典
<</Kids [2 0 R] /Count 1 /Type
/Pages>>。
Here, the object number is 1, and the generation number is 0 (it
almost always is). The content for object 1 is in between the two lines
1 0 obj and endobj. In this case, it’s the dictionary
<</Kids [2 0 R] /Count 1 /Type
/Pages>>.
交叉引用表列出了文件主体中每个对象的字节偏移量。这允许随机访问对象,因此不必按顺序读取它们,并且永远不会读取从未使用过的对象。这尤其意味着,即使在大文件上,计算 PDF 文档中的页数等简单操作也可以很快。
The cross-reference table lists the byte offset of each object in the file body. This allows random access to the objects, so that they do not have to be read in order, and an object which is never used is never read. This means, in particular, that simple operations like counting the number of pages in a PDF document can be fast, even on large files.
PDF 文件中的每个对象都有一个对象编号和生成编号。当重复使用交叉引用表条目时使用世代编号——我们在这里不考虑它们(它们将始终为零)。
Every object in a PDF file has an object number and a generation number. Generation numbers are used when a cross reference table entry is reused—we don’t consider them here (they will always be zero).
出于我们的目的,我们可以考虑交叉引用表由一个标题行组成,该标题行指示条目数,然后是一个特殊条目,然后是文件主体中每个对象的一行。在我们的文件中:
For our purposes, we can consider the cross-reference table to consist of a header line indicating the number of entries, then a special entry, then one line for each object in the file body. In our file:
0 6 Six entries in table, starting at 0 0000000000 65535 f Special entry 0000000015 00000 n Object 1 is at byte offset 15 0000000074 00000 n Object 2 is at byte offset 74 0000000192 00000 n etc... 0000000291 00000 n 0000000409 00000 n Object 5 is at byte offset 409
请注意,字节偏移量以前导零存储,以确保每个条目的长度相同。因此,我们也可以通过随机访问来读取交叉引用表。
Note that the byte offsets are stored with leading zeros to ensure each entry is the same length. Thus, we can read the cross-reference table with random access too.
预告片的第一行只是trailer关键字。这之后是
trailer dictionary,它至少包含
/Size条目(它给出了交叉引用表中的条目数)和/Root条目(它给出了文档目录的对象编号,它是图的根元素体内的物体)。
The first line of the trailer is just the trailer keyword. This is followed by the
trailer dictionary, which contains at least the
/Size entry (which gives the number
of entries in the cross-reference table) and the /Root entry (which gives the object number of
the document catalog, which is the root element
of the graph of objects in the body).
接下来的一行只有startxref关键字,一行只有一个数字(文件中交叉引用表开始的字节偏移量),然后是%%EOF表示 PDF 文件结尾的行。
There follows a line with just the startxref keyword, a line with a single number
(the byte offset of the start of the cross-reference table within the
file), and then the line %%EOF, which
signals the end of the PDF file.
这是示例 3-1中的预告片:
Here’s the trailer from Example 3-1:
trailer Trailer keyword << The trailer dictinonary /Root 5 0 R /Size 6 >> startxref startxref keyword 459 Byte offset of cross-reference table %%EOF End-of-file marker
从文件末尾向后读取尾部:找到文件尾标记,提取交叉引用表的字节偏移量,然后解析尾部字典。trailer关键字标记预告片的上限。
The trailer is read from the end of the file backwards: the
end-of-file marker is found, the byte offset of the cross-reference
table extracted, and then the trailer dictionary parsed. The trailer keyword marks the upper extent of the
trailer.
一个 PDF 文件是一个 8 位字节的序列。使用我们在本章中描述的规则,可以将这些字符分组为 标记(例如关键字和数字),然后解析文件。
A PDF file is a sequence of 8 bit bytes. Using the rules we describe in this chapter, these characters can be grouped into tokens (such as keywords and numbers), and the file parsed.
一些通用规则适用于文件的主体,并且经常适用于 PDF 文件中的各种其他语言。字符分为三种:常规字符、
空白字符和
定界符。表 3-1中列出了空白字符
。分隔符是( ) < > [ ] { } / %, 用于定义数组、字典等。所有其他字符都是常规字符,没有特殊含义。
Some general rules apply to the main body of the file, and
frequently to the various other languages in a PDF file. There are three
kinds of characters: regular characters,
whitespace characters, and
delimiters. The whitespace characters are listed in
Table 3-1. The delimiters are ( ) < > [ ] { } / %, and are used to
define arrays, dictionaries and so on. All other characters are regular
characters, with no special meaning.
PDF 文件可以使用 <CR>、<LF> 或 <CR><LF> 序列来结束一行。但是请注意,大量更改行尾(例如,在文本编辑器中)可能会损坏文件,因为它会改变恰好出现在压缩二进制数据部分中间的任何行尾序列。
PDF files can use <CR>, <LF>, or a <CR><LF> sequence to end a line. Note, however, that changing the line endings en masse (for example, in a text editor) will likely corrupt the file, since it will alter any line ending sequences that happen to occur in the midst of compressed binary data sections.
PDF 支持五个基本对象:
PDF supports five basic objects:
整数和实数,例如42和3.1415。
Integers and real numbers, such as 42 and 3.1415.
括在方括号中的字符串有多种编码方式。例如(The Quick Brown
Fox)。
Strings, which are enclosed in brackets, and come in a variety
of encodings. For example (The Quick Brown
Fox).
名称,用于字典中的键,以及无数其他用途。/例如,它们以 开头/Blue。
Names, which are used for keys in dictionaries, and innumerable
other purposes. They are introduced with a /, for example /Blue.
布尔值,由关键字true和表示false。
Boolean values, denoted by the keywords true and false.
空对象,由关键字表示null。
The null object, denoted by the keyword null.
和三个复合对象:
and three compound objects:
数组,其中包含其他对象的有序集合,例如[1 0 0 0].
Arrays, which contain an ordered collection of other objects
such as [1 0 0 0].
字典,由无序的对集合组成,将名称映射到对象。例如,<</Contents 4 0 R /Resources 5 0
R>>映射/Contents到间接引用4 0 R和/Resources间接引用
5 0 R。
Dictionaries, which consist of an unordered collection of pairs,
mapping names to objects. For example, <</Contents 4 0 R /Resources 5 0
R>>, which maps /Contents to the indirect reference 4 0 R and /Resources to the indirect reference
5 0 R.
流,其中包含二进制数据,以及描述数据属性(例如其长度和任何压缩参数)的字典。流用于存储图像、字体等。
Streams, which hold binary data, together with a dictionary describing attributes of the data such as its length and any compression parameters. Streams are used to store images, fonts and so on.
以及将对象链接在一起的方法:
and a way of linking objects together:
间接引用,形成从一个对象到另一个对象的链接。
The indirect reference, which forms a link from one object to another.
PDF 文件由对象图组成,间接引用构成它们之间的链接。示例 3-1的对象图如图 3-1所示。
A PDF file consists of a graph of objects, with indirect references forming the links between them. The object graph for Example 3-1 is shown in Figure 3-1.
一个整数被写成一个或多个十进制数字,
0..9可选地在前面加上一个加号或减号:
An integer is written as one or more of the decimal digits
0..9 optionally preceded by a plus or
minus sign:
0 +1 -1 63
0 +1 -1 63
一个实数被写成一位或多位十进制数字,前面可选加号或减号,并且可选地有一个小数点,小数点可以是前导、内部或后面:
A real number is written as one or more decimal digits optionally preceded by a plus or minus sign, and optionally having one decimal point, which may be leading, inside, or following:
0.0 0. .0 -0.004 65.4
0.0 0. .0 -0.004 65.4
通常,规范允许给定对象是整数或实数。其他时候它必须是一个整数。此外,整数和实数的范围和精度由 PDF 实现定义,而不是标准。在某些实现中,如果整数超出可用范围,则将其转换为实数。
Frequently, the specification allows a given object to be either an integer or a real number. Other times it must be an integer. In addition, the range and accuracy of integers and reals is defined by the PDF implementation, not the standard. In certain implementations, if an integer exceeds the range available, it is converted to a real number.
不允许使用指数符号。例如,你不能写4.5e-6.
Exponential notation is not allowed. For example, you can’t
write 4.5e-6.
字符串由一系列字节组成,写在括号之间:
Strings consist of a series of bytes, written between parentheses:
(Hello, World!)
反斜杠\字符和括号字符( )必须通过在它们前面加上反斜杠来转义。例如,写:
The backslash \ character and
the parenthesis characters ( ) must
be escaped by preceding them with a backslash. For example,
writing:
(Some \\ escaped \(characters)
代表字符串“ Some \
escaped (characters”。字符串中的平衡括号对不需要转义。例如,(Red (Rouge))表示字符串
“ Red (Rouge)”。
represents the string “Some \
escaped (characters”. Balanced pairs of parentheses
within the string need not be escaped. For example, (Red (Rouge)) represents the string
“Red (Rouge)”.
反斜杠也可用于引入其他字符代码以提高可读性(参见表 3-2)。
A backslash can also be used to introduce other character codes for readability purposes (see Table 3-2).
表 3-2。字符串中的转义序列
Table 3-2. Escape sequences in strings
| 字符序列 | 意义 |
|---|---|
\n | 换行 |
\r | 回车 |
\t | 水平制表符 |
\b | 退格键 |
\f | 换页 |
\
ddd | 三个八进制数字的字符代码 |
从文件中读取字符串并解析转义字符以产生构成字符串本身的一系列字节后,它可能会按照文本字符串中的描述进行解释。
After the string is read from the file, and the escaped characters resolved to yield the series of bytes forming the string proper, it may then be interpreted as described in Text Strings.
字符串也可以写成 和 之间的十六进制数字序列<,>每对代表一个字节:
Strings can also be written as a sequence of hexadecimal digits
between < and >, each pair representing a byte:
<4F6Eff00> Bytes 0x4F, 0x6E, 0xFF, and 0x00当有奇数位时,最后一位被假定为
0. 十六进制字符串通常用于使二进制数据用户可读。它在功能上与以通常方式描述字符串相同。
When there is an odd number of digits, the last is assumed to be
0. Hexadecimal strings are
typically used to make binary data user-readable. It is functionally
the same as describing strings in the usual way.
名称在整个 PDF 中使用,作为字典的键和定义各种多值对象,在这些对象中使用整数来枚举它们是不直观的。使用正斜杠引入名称。例如:
Names are used throughout PDF, as keys for dictionaries and to define various multi-valued objects where using integers to enumerate them would be unintuitive. A name is introduced with the forward slash. For example:
/French
/字符是名字的一部分——事实上,它/本身就是一个有效的名字。名称可能不包含空格或定界符,但如果名称需要对应于具有这些字符(例如空格)的某些外部名称,我们可以使用哈希符号后跟两位十进制数字:
The / character is part of the
name—in fact, / on its own is a valid
name. The name may not contain whitespace or delimiters, but where a
name needs to correspond to some external name which has these
characters (for example, spaces), we can use a hash sign followed by two
decimal digits:
/Websafe#20Dark#20Green
/Websafe#20Dark#20Green
这表示名称/Websafe Dark
Green,因为在 ASCII 中,十六进制 20 是空格代码。名称区分大小写(/French并且
/french不同)。
This represents the name /Websafe Dark
Green since, in ASCII, hexadecimal 20 is the code for space.
Names are case-sensitive (/French and
/french are different).
PDF 允许布尔值true和false. 它们经常用作字典条目中的标志。
PDF allows the boolean values true and false. They are frequently used as flags in
dictionary entries.
数组表示 PDF 对象的有序集合,包括其他数组。这些对象不必都是同一类型。例如数组:
An array represents an ordered collection of PDF objects, including other arrays. The objects need not all be of the same type. For example, the array:
[0 0 400 500]
[0 0 400 500]
包含四个数字,顺序为:0, 0,
400, 500。数组:
contains four numbers in order: 0, 0,
400, 500. The array:
[/Green /Blue [/Red /Yellow]]
包含三项:姓名/Green、姓名/Blue和两个姓名的数组[/Red /Yellow]。
contains three items: the name /Green, the name /Blue and the array of two names [/Red /Yellow].
字典表示
键值对的无序集合。字典将键映射到值——提供一个键,值是在字典中查找的结果。键是名称,值可以是任何 PDF 对象。字典写在<<和之间>>。例如:
A dictionary represents an unordered collection of
key-value pairs. The dictionary maps the keys to
the values—provide a key, and the value is the result of looking it up
in the dictionary. The keys are names, the values may be any PDF object.
Dictionaries are written between << and >>. For example:
<</One 1 /Two 2 /Three 3>>
将名称映射/One到整数1,将名称映射/Two到整数2,将名称映射/Three到整数3。当然,词典可以包含其他词典。嵌套字典构成了大多数 PDF 文件中的大部分非图形结构化数据。
maps the name /One to the
integer 1, the name /Two to the integer 2, and the name /Three to the integer 3. Dictionaries can, of course, contain other
dictionaries. Nested dictionaries form the bulk of the non-graphical
structured data in most PDF files.
为了将 PDF 内容拆分为单独的对象(以便仅在需要时读取数据),我们使用 间接引用将它们连接在一起。对象6的间接引用写成:
In order to split the PDF content over separate objects (so data may be read only if required), we connect them together with indirect references. The indirect reference to object 6 is written as:
6 0 R
这里,6是对象编号,
0是世代编号(我们这里不考虑),R是间接引用关键字。
Here, 6 is the object number,
0 is the generation number (which we
don’t consider here), and R is the
indirect reference keyword.
例如,这是一个使用间接引用的典型字典:
For example, here’s a typical dictionary using indirect references:
<< /Resources 10 0 R /Contents [4 0 R] >>
在此示例中,对象10和
4在字典的值中被引用。
In this example, objects 10 and
4 are being referenced in the values
of a dictionary.
流用于存储二进制数据。它们由字典和一大块二进制数据组成。根据流的特定用途,字典列出了数据的长度,以及可选的其他参数。
Streams are used to store binary data. They are formed of a dictionary followed by a chunk of binary data. The dictionary lists the length of the data, and optionally other parameters, according to the particular use to which the stream is put.
从句法上讲,流由字典、
stream关键字、换行符(<LF> 或 <CR><LF>)、零个或多个字节的数据、另一个换行符和最后的endstream关键字组成。从我们的示例文件:
Syntactically, a stream consists of a dictionary, followed by the
stream keyword, a newline (<LF>
or <CR><LF>), zero or more bytes of data, another newline, and
finally the endstream keyword. From our
example file:
4 0 obj Object 4 << /Length 65 Length of the data >> stream Stream keyword 1. 0. 0. 1. 50. 700. cm 65 bytes of data, here a graphics stream BT /F0 36. Tf (Hello, World!) Tj ET endstream endstream keyword endobj end of object
在这里,字典只包含/Length条目,它以字节为单位给出了流的长度。
Here, the dictionary just contains the /Length entry, which gives the length of the
stream in bytes.
所有流都必须是间接对象。流几乎总是被压缩,使用表 3-3中列出的各种机制。
All streams must be indirect objects. Streams are almost always compressed, using a variety of mechanisms, which are listed in Table 3-3.
表 3-3。PDF流压缩方式
Table 3-3. PDF stream compression methods
| 方法名称 | 描述 |
|---|---|
/ASCIIHexDecode | 为压缩数据中的每对十六进制数字生成一个字节的未压缩数据。>表示数据结束。空格被忽略。这个过滤器
/ASCII85Decode旨在将数据减少到 7/ASCII85Decode位——更复杂,但更紧凑。 |
/ASCII85Decode | 这种 7 位编码使用可打印字符!到u和z。该序列~>表示数据结束。 |
/LZWDecode | 实施 TIFF 图像格式所使用的 Lempel-Ziv-Welch 压缩。 |
/FlateDecode | Flate 压缩,由开源 zlib 库使用。在 RFC 1950 中定义。/LZWDecode和/FlateDecode都可以在流字典中有
预测器,它定义数据的后处理以反向压缩时完成的预处理。 |
/RunLengthDecode | 一个简单的基于字节的游程压缩器。 |
/CCITTFaxDecode | 实现传真机使用的 Group 3 和 Group 4 编码。适用于单色 (1bpp) 图像,不适用于一般数据。 |
/JBIG2Decode | 一种更现代、更好的压缩机制,适用于适合使用的各类数据/CCITTFaxDecode,也适用于灰度和彩色图像以及一般数据。实现 JBIG2 压缩方法。 |
/DCTDecode | JPEG 有损压缩。完整的 JPEG 文件可以放在这里,包括所有的标题。 |
/JPXDecode | JPEG2000 有损和无损压缩。仅限于 JPX 基线功能集,但有少数例外。 |
下面是一个压缩流的示例:
Here’s an example of a compressed stream:
796 0 obj
<</Length 275 /Filter /FlateDecode>>
stream
HTKO0֟ And 268 more bytes...
endstream
endobj可以使用多个过滤器,方法是指定一个数组而不是/Filter流字典中条目的名称。例如,使用 JPEG 方法压缩然后进行 ASCII85 编码的图像可能具有以下过滤器条目:
Multiple filters can be used, by specifying an array instead of a
name for the /Filter entry in the
stream’s dictionary. For example, an image compressed with the JPEG method
then ASCII85 encoded, might have the following filter entry:
/Filter [/ASCII85Decode /DCTDecode]
需要外部参数的过滤器(例如,在数据流本身之外定义压缩参数)也将它们存储在流字典中。
Filters which require external parameters (for example, defining compression parameters outside the data stream itself) store those in the stream dictionary too.
增量更新允许通过将修改附加到文件末尾来更新文件,因此不需要再次写入整个文件(对于大文件,这可能需要很长时间)。更新构成了新的或改变的对象,以及对交叉引用表的更新。这意味着保存更改花费的时间更少,但文件可能会变得臃肿(因为不再需要的对象无法删除)。
Incremental update allows a file to be updated by appending modifications to the end of the file, so the whole file doesn’t need to be written again (which, for a large file, could take a long time). The update constitutes the new or changed objects, and an update to the cross-reference table. This means saving the changes takes less time, but the file may become bloated (because objects which are no longer needed cannot be deleted).
此更新过程可能会发生多次。一个副作用是,以这种方式更新的文件可能会撤消一个或多个级别的更改,以检索文档的早期版本。
This updating process may happen several times. A side-effect is that files updated in this fashion may have those changes undone one or more levels, to retrieve earlier versions of the document.
更改经过数字签名的文档时,所有更新都必须递增 - 否则,数字签名将失效。收件人可以撤消增量更新以检索原始的、经过认证的文档。
When altering a digitally signed document, all updates must be made incrementally—otherwise, the digital signature would be invalidated. The recipient can undo the incremental updates to retrieve the original, certified document.
当一个文件被增量更新时,一个新的预告片被添加,包含来自先前预告片的所有条目,以及一个
/Prev给出先前交叉引用表的字节偏移量的条目。因此,已增量更新的文件将具有多个尾部字典和文件结束标记。通过这种方式,PDF 应用程序可以以相反的顺序读取交叉引用部分,以构建文件中每个对象的最新版本列表。已被替换的对象保持相同的对象编号。
When a file is updated incrementally, a new trailer is added,
containing all the entries from the previous trailer, together with a
/Prev entry giving the byte offset of
the previous cross-reference table. Thus, a file which has been
incrementally updated will have multiple trailer dictionaries and
end-of-file markers. In this way, a PDF application can read the
cross-reference sections in reverse order to build up a list of the latest
versions of each object in the file. Objects which have been replaced keep
the same object number.
从 PDF 1.5 开始,引入了一种新机制来进一步压缩 PDF 文件,允许将许多对象放入单个 对象流中,整个流都被压缩。同时,引入了一种引用这些流中对象的新机制——交叉引用流。
Starting with PDF 1.5, a new mechanism was introduced to further compress PDF files by allowing many objects to be put into a single object stream, the whole stream being compressed. In tandem, a new mechanism for referencing the objects in these streams was introduced—cross-reference streams.
一个文件通常会使用几组对象流,将某些时候需要的对象组合在一起,例如第一页上的所有对象,第二页上的所有对象,等等。这保留了文档的随机访问属性,如果将文件中的所有对象都放入单个对象流中,该属性将会丢失。对象流不能包含其他流。
A file will generally use several sets of object streams, grouping together objects which are needed at certain times, for example all the objects on page one, all the objects on page two, and so on. This retains the random access property of the document, which would be lost if all the objects in a file were to be put into a single object stream. Object streams can’t contain other streams.
用这些机制压缩的文件,手动读取比较困难,所以我们可以像往常一样使用pdftkdecompress中的
操作,将解压后的文件重写以供查看。这具有在没有对象和交叉引用流的情况下编写它们的副作用。详见第 9 章
。
Files compressed with these mechanisms are rather hard to read
manually, so we can use the decompress
operation in pdftk as usual, to rewrite them
decompressed for inspection. This has the side effect of writing them
without object and cross-reference streams. See Chapter 9
for details.
在网络环境中查看大型 PDF 文件时,尤其是在数据速率低或网络延迟高的情况下,用户不希望等待整个文件下载后再查看。当在 Web 浏览器中查看文档时,这一点尤其重要。
When viewing a large PDF file in a network environment, especially when the data rate is low or the network latency high, the user does not want to wait for the whole file to download to view it. This is especially important when the document is being viewed inside a web browser.
我们应该希望第一页快速出现,并尽可能快地切换到另一页(通过单击超链接或书签)。在单个页面很大(而不是整个文档)的情况下,我们应该希望页面内容递增显示,最重要的内容首先出现。HTTP(超文本传输协议,用于在网络浏览器中获取网页)等网络传输机制通常允许获取任意数据块。然而,由于延迟,我们希望获取包含页面所有数据的单个块,而不是数百个小块,每个对象一个。
We should like the first page to appear quickly, and for changing to another page (by clicking on a hyperlink or a bookmark) to be as fast as possible. In the case of individual pages being large (rather than just the whole document), we should like page content to appear incrementally, most-important content first. Network transport mechanisms such as HTTP (The HyperText Transfer Protocol, used for fetching web pages in a web browser) often allow an arbitrary chunk of data to be fetched. However, because of latency, we wish to fetch a single chunk with all the data for a page, rather than hundreds of little chunks, one for each object.
PDF 1.2引入了这样一种机制,线性化PDF。这添加了有关如何对文件中的对象进行排序的规则,并添加了 提示表以指示此类对象的排序方式。该系统是向后兼容的,因此线性化 PDF 文件也是一个普通文件,不理解线性化 PDF 的读者也可以这样阅读。
PDF 1.2 introduced such a mechanism, linearized PDF. This adds rules for how to order objects in a file and hint tables to indicate how such objects have been ordered. The system is backward compatible, so that a linearized PDF file is also a normal one, and may be read as such by a reader which does not understand linearized PDF.
线性化 PDF 文件可以通过 文件顶部直接在标题之后的线性化字典来识别。例如:
A linearized PDF file can be recognized by the presence of a linearization dictionary at the top of the file, directly after the header. For example:
%PDF-1.4 %âãÏÓ 4 0 obj << /E 200967 /H [ 667 140 ] /L 201431 /Linearized 1 /N 1 /O 7 /T 201230 >> endobj
GhostScript 附带的pdfopt命令行程序可以线性化文件。例如:
The pdfopt command line program shipped with GhostScript can linearize a file. For example:
pdfopt input.pdf output.pdf
这input.pdf会将结果线性化并写入output.pdf.
This linearizes input.pdf and
writes the result to output.pdf.
要读取 PDF 文件,将其从平面字节序列转换为内存中的对象图,通常可能会发生以下步骤:
To read a PDF file, converting it from a flat series of bytes into a graph of objects in memory, the following steps might typically occur:
从文件开头读取 PDF 标题,检查这确实是一个 PDF 文档并检索其版本号。
Read the PDF header from the beginning of the file, checking that this is, indeed, a PDF document and retrieving its version number.
现在通过从文件末尾向后搜索找到文件末尾标记。现在可以读取尾部字典,并检索交叉引用表开头的字节偏移量。
The end-of-file marker is now found, by searching backward from the end of the file. The trailer dictionary can now be read, and the byte offset of the start of the cross-reference table retrieved.
现在可以读取交叉引用表。我们现在知道文件中每个对象的位置。
The cross-reference table can now be read. We now know where each object in the file is.
在这个阶段,所有的对象都可以被读取和解析,或者我们可以离开这个过程,直到真正需要每个对象,按需读取它。
At this stage, all the objects can be read and parsed, or we can leave this process until each object is actually needed, reading it on demand.
我们现在可以使用数据、提取页面、解析图形内容、提取元数据等。
We can now use the data, extracting the pages, parsing graphical content, extracting metadata, and so on.
这不是一个详尽的描述,因为有许多可能的复杂性(加密、线性化、对象和交叉引用流)。
This is not an exhaustive description, since there are many possible complications (encryption, linearization, objects, and cross-reference streams).
伪代码中给出的以下递归数据结构可以保存 PDF 对象。
The following recursive data structure, given in psuedocode, can hold a PDF object.
pdfobject ::= Null
| Boolean of bool
| Integer of int
| Real of real
| String of string
| Name of string
| Array of pdfobject array
| Dictionary of (string, pdfobject) array Array of (string, pdfobject) pairs
| Stream of (pdfobject, bytes) Stream dictionary and stream data
| Indirect of int例如,对象<< /Kids [2 0
R] /Count 1 /Type /Pages >>可能表示为:
For example, the object << /Kids [2 0
R] /Count 1 /Type /Pages >> might be represented
as:
Dictionary ((Name (/Kids), Array (Indirect 2)), (Name (/Count), Integer (1)), (Name (/Type), Name (/Pages)))
本章前面的图 3-1显示了示例 3-1中文件的对象图。
Figure 3-1, shown earlier in the chapter, shows the object graph for the file in Example 3-1.
将 PDF 文档写入文件中的一系列字节比读取它简单得多 — 我们不需要支持所有 PDF 格式,只需要支持我们打算使用的子集。编写 PDF 文件非常快,因为它只不过是将对象图扁平化为一系列字节。
Writing a PDF document to a series of bytes in a file is much simpler than reading it—we don’t need to support all of the PDF format, just the subset we intend to use. Writing a PDF file is very fast, since it amounts to little more than flattening the object graph to a series of bytes.
输出标题。
Output the header.
删除 PDF 中任何其他对象未引用的任何对象。这避免了编写不再需要的对象。
Remove any objects which are not referenced by any other object in the PDF. This avoids writing objects which are no longer needed.
重新编号对象,使它们从1到文件n
中n的对象数。
Renumber the objects so they run from 1 to n
where n is the number of objects in
the file.
一个一个输出对象,从第一个对象开始,记录每个对象的字节偏移量,用于交叉引用表。
Output the objects one by one, starting with object number one, recording the byte offset of each for the cross-reference table.
编写交叉引用表。
Write the cross-reference table.
编写预告片、预告片字典和文件结束标记。
Write the trailer, trailer dictionary, and end-of-file marker.
在本章中,我们抛开 PDF 文件的位和字节,并考虑逻辑结构。我们考虑尾部字典、文档目录和 页面树。我们枚举每个对象中所需的条目。然后我们看一下 PDF 文件中的两种常见结构:文本字符串和日期。
In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate the required entries in each object. We then look at two common structures in PDF files: text strings and dates.
图 4-1显示了典型文档的逻辑结构。
Figure 4-1 shows the logical structure of a typical document.
这个字典位于文件的尾部而不是文件的主体中,是程序想要阅读 PDF 文档时首先要处理的事情之一。它包含允许读取交叉引用表以及文件对象的条目。其重要条目总结在表 4-1中。
This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the first things to be processed when a program wants to read a PDF document. It contains entries allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries are summarized in Table 4-1.
表 4-1。预告片字典中的条目(*表示必需条目)
Table 4-1. Entries in a trailer dictionary (*denotes required entry)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Size* | 整数 | 文件的交叉引用表中的条目总数(通常等于文件中的对象数加一)。 |
/Root* | 间接引用字典 | 文档目录。 |
/Info | 间接引用字典 | 文档的文档信息字典。 |
/ID | 两个字符串的数组 | 在工作流程中唯一标识文件。第一个字符串在文件首次创建时决定,第二个字符串由工作流系统在修改文件时修改。 |
这是一个示例预告片字典:
Here’s an example trailer dictionary:
<< /Size 421 /Root 377 0 R /Info 375 0 R /ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>] >>
trailer 字典处理完成后,我们可以继续读取文档信息字典和 文档目录。
Once the trailer dictionary has been processed, we can go on to read the document information dictionary and the document catalog.
文档信息字典包含文件的创建和修改日期,以及一些简单的元数据(不要与XML 元数据 中讨论的更全面的 XMP 元数据混淆)。
The document information dictionary contains the creation and modification dates of the file, together with some simple metadata (not to be confused with the more comprehensive XMP metadata discussed in XML Metadata).
表 4-2描述了文档信息字典条目。示例 4-1中给出了一个典型的文档信息字典。
Document information dictionary entries are described in Table 4-2. A typical document information dictionary is given in Example 4-1.
表 4-2。文档信息字典中的条目。“文本字符串”和“日期字符串”类型将在本章后面解释。
Table 4-2. Entries in a document information dictionary. The types “text string” and “date string” are explained later in this chapter.
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Title | 文本字符串 | 文档的标题。请注意,这与第一页上显示的任何标题无关。 |
/Subject | 文本字符串 | 文档的主题。同样,这只是元数据,没有关于内容的特定规则。 |
/Keywords | 文本字符串 | 与此文档关联的关键字。没有给出关于如何构建这些的建议。 |
/Author | 文本字符串 | 文档作者的姓名。 |
/CreationDate | 日期字符串 | 创建文档的日期。 |
/ModDate | 日期字符串 | 上次修改文档的日期。 |
/Creator | 文本字符串 | 最初创建此文档的程序的名称,如果它以另一种格式开始(例如, “ Microsoft Word ”)。 |
/Producer | 文本字符串 | 将此文件转换为 PDF 的程序的名称,如果它以另一种格式开始(例如,文字处理器的格式)。 |
示例 4-1。典型文件信息词典
Example 4-1. Typical document information dictionary
<< /ModDate (D:20060926213913+02'00') /CreationDate (D:20060926213913+02'00') /Title (catalogueproduit-UK.qxd) /Creator (QuarkXPress: pictwpstops filter 1.0) /Producer (Acrobat Distiller 6.0 for Macintosh) /Author (James Smith) >>
日期字符串格式(对于和/CreationDate)在Dates/ModDate部分讨论。文本字符串格式(描述了如何在字符串类型中使用不同的编码)在文本字符串中进行了描述。
The date string format (for /CreationDate and /ModDate) is discussed in the section Dates. The text string format (which
describes how different encodings can be used within the string type) is
described in Text Strings.
文档目录是主对象图的根对象,从中可以通过间接引用到达所有其他对象。在表 4-3中,我们列出了必需的文档目录字典条目,以及许多可选条目中的一些条目,以介绍我们在这些页面的其他地方没有涉及的简短 PDF 主题。
The document catalog is the root object of the main object graph, from which all other objects may be reached through indirect references. In Table 4-3, we list the document catalog dictionary entries which are required, and some of the many optional ones, so as to introduce brief PDF topics we don’t cover elsewhere in these pages.
表 4-3。文档目录(*表示必填项)
Table 4-3. The document catalog (*denotes required entry)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type* | 姓名 | 一定是/Catalog。 |
/Pages* | 间接引用字典 | 页树的根节点。页面树在 页面和页面树中讨论。 |
/PageLabels | 数字树 | 为该文档提供页面标签的数字树。这种机制允许文档中的页面具有比 1、2、3……更复杂的编号。例如,一本书的序言可能编号为 i、ii、iii...,而主要内容又从 1、2、3...开始。这些页面标签显示在 PDF 查看器中——它们与印刷版无关输出。 |
/Names | 字典 | 名字字典。这包含各种名称树,将名称映射到实体,以防止必须使用对象编号直接引用它们。 |
/Dests | 字典 | 将名称映射到目的地的字典。目的地是对 PDF 文档中超链接将用户发送到的位置的描述。 |
/ViewerPreferences | 字典 | 查看器首选项字典,它允许标志指定在屏幕上查看文档时 PDF 查看器的行为,例如打开它的页面、初始查看比例等。 |
/PageLayout | 姓名 | 指定 PDF 查看器使用的页面布局。值为/SinglePage, /OneColumn, /TwoColumnLeft, /TwoColumnRight,
/TwoPageLeft, /TwoPageRight。(默认值:)/SinglePage。详情见 ISO 32000-1:2008 的表 28。 |
/PageMode | 姓名 | 指定 PDF 查看器使用的页面模式。值为/UseNone, /UseOutlines, /UseThumbs, /FullScreen,
/UseOC,
/UseAttachments。(默认值:)/UseNone。详情见 ISO 32000-1:2008 的表 28。 |
/Outlines | 间接引用字典 | 大纲字典是 文档大纲的根,俗称书签。 |
/Metadata | 间接引用流 | 文档的 XMP 元数据 - 请参阅XML 元数据。 |
PDF 文档中的页面字典将绘制图形和文本内容(我们将在第 5章和第 6 章中讨论)的指令与这些指令使用的资源(字体、图像和其他外部数据)结合在一起。它还包括页面大小,以及一些 定义裁剪等的其他框。
A page dictionary in a PDF document brings together instructions for drawing the graphical and textual content (which we consider in Chapter 5 and Chapter 6) with the resources (fonts, images, and other external data) which those instructions make use of. It also includes the page size, together with a number of other boxes defining cropping and so forth.
页字典中的条目总结在表 4-4中。
The entries in a page dictionary are summarized in Table 4-4.
表 4-4。页面字典中的条目(*表示必需条目)
Table 4-4. Entries in a page dictionary (*denotes required entry)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type* | 姓名 | 一定是/Page。 |
/Parent* | 间接引用字典 | 页树中该节点的父节点。 |
/Resources | 字典 | 页面的资源(字体、图像等)。如果完全省略此条目,则资源将从页面树中的父节点继承。如果确实没有资源,请包含此条目但使用空字典。 |
/Contents | 对此类引用的流或数组的间接引用 | 一个或多个部分中的页面图形内容。如果缺少此条目,则页面为空。 |
/Rotate | 整数 | 页面的查看旋转度数,从北顺时针方向。值必须是 90 的倍数。默认值:0。这适用于查看和打印。如果缺少此项,则其值将从页面树中的父节点继承。 |
/MediaBox* | 矩形 | 页面的媒体框(其媒体大小,即纸张)。对于大多数用途,页面大小。如果此条目丢失,则它是从页面树中的父节点继承的。 |
/CropBox | 矩形 | 页面的裁剪框。这定义了页面显示或打印时默认可见的页面区域。如果不存在,其值被定义为与媒体框相同。 |
媒体框和其他框的矩形数据结构 是一个包含四个数字的数组。它们定义了矩形的对角线——数组的前两个元素是一个角的x和 y坐标,后两个元素是另一个角的坐标。通常,会给出左下角和右上角。所以,例如:
The rectangle data structure for the media box and the other boxes is an array of four numbers. These define the diagonally opposite corners of the rectangle—the first two elements of the array being the x and y coordinates of one corner, the latter two elements being those of the other. Normally, the lower-left and upper-right corners are given. So, for example:
/MediaBox [0 0 500 800] /CropBox [100 100 400 700]
定义一个 500 x 800 点的页面,裁剪框在页面的每一侧移除 100 点。
defines a 500 by 800 point page with a crop box removing 100 points on each side of the page.
这些页面使用页面树而不是简单的数组链接在一起。这种树结构使得在具有数百或数千页的文档中查找给定页面变得更快。好的 PDF 应用程序会构建一棵平衡树 (节点数具有最小高度的树)。这确保可以快速定位特定页面。没有子节点的节点是页面本身。图 4-2显示了七个页面的示例页面树结构。
The pages are linked together using a page tree, rather than a simple array. This tree structure makes it faster to find a given page in a document with hundreds or thousands of pages. Good PDF applications build a balanced tree (one with the minimum height for the number of nodes). This ensures that a particular page can be located quickly. The nodes with no children are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-2.
这将以 PDF 对象的形式编写,如例 4-2所示。表 4-5总结了中间或根页面树节点(即,不是页面本身)中的条目。
This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate or root page tree node (i.e., not a page itself) are summarized in Table 4-5.
图 4-2。七页的页树。树的确切形状留给各个 PDF 应用程序。该树的 PDF 代码如例 4-2所示。
Figure 4-2. A page tree for seven pages. The exact shape of the tree is left to the individual PDF application. The PDF code for this tree is shown in Example 4-2.
示例 4-2。用于构建图 4-2所示页面树的 PDF 对象
Example 4-2. PDF objects used to build the page tree illustrated in Figure 4-2
1 0 obj Root node << /Type /Pages /Kids [2 0 R 3 0 R 4 0 R] /Count 7 >> endobj 2 0 obj Intermediate node << /Type /Pages /Kids [5 0 R 6 0 R 7 0 R] /Parent 1 0 R /Count 3 >> endobj 3 0 obj Intermediate node << /Type /Pages /Kids [8 0 R 9 0 R 10 0 R] /Parent 1 0 R /Count 3 >> endobj 4 0 obj Page 7 << /Type /Page /Parent 1 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 5 0 obj Page 1 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 6 0 obj Page 2 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 7 0 obj Page 3 << /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 8 0 obj Page 4 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 9 0 obj Page 5 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj 10 0 obj Page 6 << /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >> endobj
表 4-5。中间或根页面树节点中的条目(*表示必需的条目)
Table 4-5. Entries in an intermediate or root page tree node (*denotes a required entry)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type* | 姓名 | 一定是/Pages。 |
/Kids* | 间接引用数组 | 此节点的直接子页树节点。 |
/Count* | 整数 | 作为该节点最终子节点的页面节点(不是其他页面树节点)的数量。 |
/Parent | 间接引用页面树节点 | 引用此节点的父节点(this 是其子节点)。如果不是页面树的根节点,则必须存在。 |
在这棵树中,任何页面最多可以从根节点找到两个间接引用。
In this tree, any page can be found at most two indirect references away from the root node.
页面实际文本内容之外的字符串(例如,书签名称、文档信息等)称为文本字符串。它们使用 PDFDocEncoding或(在最近的文档中)Unicode 进行编码。PDFDocEncoding 是一种基于 ISO Latin-1 的编码。它在 ISO 标准 32000-1:2008 的附录 D 中有完整的记录。
Strings outside of the actual textual content of a page (e.g., bookmark names, document information etc.) are known as text strings. They are encoded using either PDFDocEncoding or (in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1 Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.
编码为 Unicode 的文本字符串通过查看前两个字节来区分:它们将是 254,然后是 255。这是 Unicode 字节顺序标记 U+FEFF,它指示 UTF16BE 编码。这意味着 PDFDocEncoding 字符串不能以 þ (254) 后跟 ÿ (255) 开头,但这在任何合理的情况下都不太可能发生。
Text strings which are encoded as Unicode are distinguished by looking at the first two bytes: these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ (254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.
创建和修改日期/CreationDate以及/ModDate文档信息字典中的日期是 PDF 日期格式的示例,它将日期编码为字符串,包括有关时区的信息。
The creation and modification dates /CreationDate and /ModDate in the document information dictionary
are examples of the PDF date format, which encodes a date in a string,
including information about the time zone.
日期字符串的格式为:
A date string has the format:
(D:YYYYMMDDHHmmSSOHH'mm')
(D:YYYYMMDDHHmmSSOHH'mm')
圆括号照常表示字符串。日期的其他部分总结在表 4-6中。
where the parentheses indicate a string as usual. The other parts of the date are summarised in Table 4-6.
表 4-6。PDF 日期格式成分
Table 4-6. PDF date format constituents
| 部分 | 意义 |
|---|---|
YYYY | 年份,四位数,例如2008. |
MM | 01月份,从到的两位数12。 |
DD | 01天,从到的两位数31。 |
HH | 00小时,从到的两位数23。 |
mm | 00分钟,从到的两位数59。 |
SS | 第二个,两位数从00到59。 |
O | 本地时间与世界时间的关系,可以是
+,-也可以是Z。+
表示本地时间晚于 UT,-早Z于世界时。 |
HH' | 与世界时的偏移量的绝对值,以小时为单位,以两位数字表示,从00到
23. |
mm' | 与通用时间的偏移量的绝对值(以分钟为单位),从00
到的两位数字59。 |
年份之后的日期的所有部分都是可选的。例如,
(D:1999)是完全有效的。但是,很明显,如果省略一个部分,则必须省略后面的所有内容,否则结果将是模棱两可的。DD 和 MM 的默认值为 01,对于所有其他部分,默认值为零。
All parts of the date after the year are optional. For example,
(D:1999) is perfectly valid. Plainly,
though, if you omit one part, you must omit everything which follows,
otherwise the result would be ambiguous. The default values for DD and MM
is 01, for all other parts, the default is zeros.
例如:
For example:
(D:20060926213913+02'00')
(D:20060926213913+02'00')
代表 2006 年 9 月 26 日晚上 9:39:13,时区比世界时间早两个小时。
represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal Time.
这是一个手动创建的文本,将由pdftk使用第 2 章介绍的方法处理成有效的 PDF 文件。它是一个三页的文档,带有文档信息字典和页面树。图 4-3显示了该文档在 Acrobat Reader 中的显示。图 4-4是对应的对象图。
This is a manually-created text, to be processed into a valid PDF file by pdftk using the method introduced in Chapter 2. It is a three page document, with document information dictionary and page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the corresponding object graph.
示例 4-3。一个三页的文档,带有文档信息字典
Example 4-3. A three page document with document information dictionary
%PDF-1.0 Header 1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node << /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >> endobj 4 0 obj Contents stream for page one << >> stream 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET endstream endobj 2 0 obj Page one << /Rotate 0 /Parent 1 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [4 0 R] >> endobj 5 0 obj Document catalog << /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >> endobj 6 0 obj Page three << /Rotate 0 /Parent 3 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [7 0 R] >> endobj 3 0 obj Intermediate page tree node, linking to pages two and three << /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >> endobj 8 0 obj Page two << /Rotate 270 /Parent 3 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [9 0 R] >> endobj 9 0 obj Content stream for page two << >> stream q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q 1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj ET endstream endobj 7 0 obj Content stream for page three << >> stream 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET endstream endobj 10 0 obj Document information dictionary << /Title (PDF Explained Example) /Author (John Whitington) /Producer (Manually Created) /ModDate (D:20110313002346Z) /CreationDate (D:2011) >> endobj xref 0 11 trailer Trailer dictionary << /Info 10 0 R /Root 5 0 R /Size 11 /ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>] >> startxref 0 %%EOF
图 4-3。 示例 4-3使用 pdftk 转换为有效的 PDF 并在 Acrobat Reader 中显示
Figure 4-3. Example 4-3 converted to a valid PDF with pdftk and displayed in Acrobat Reader
在本章中,我们将介绍在 PDF 页面的内容流中构建图形的主要方法。所有示例都基于我们在第 2 章中手动创建的同一个 PDF,并以相同的方式使用pdftk处理成有效的 PDF 文档。所有示例都包含在在线资源中。
In this chapter, we’ll run through the main ways to build graphics in the content stream of a PDF page. All of the examples are based on the same PDF we created manually in Chapter 2 and processed into valid PDF documents with pdftk in the same fashion. All the examples are included in the online resources.
PDF 页面由一个或多个内容流组成,由/Contents页面对象中的条目定义,以及由条目定义的一组共享资源/Resources。在我们所有的例子中,只有一个内容流。多个内容流等同于包含其串联内容的单个流。
A PDF page is made up of one or more content
streams, defined by the /Contents entry in the page object, together
with a shared set of resources, defined by the /Resources entry. In all our examples, there
will only be a single content stream. Multiple content streams are
equivalent to a single stream containing their concatenated
content.
这是一个示例页面,没有资源和单个内容流:
Here’s an example page, with no resources and a single content stream:
3 0 obj << /Type /Page /Parent 1 0 R /Resources << >> /MediaBox [ 0 0 792 612 ] /Rotate 0 /Contents [ 2 0 R ] >> endobj
这是关联的内容流,由 流字典和流数据组成。
Here’s the associated content stream, consisting of the stream dictionary and the stream data.
2 0 obj << /Length 18 >> Stream dictionary stream 200 150 m 600 450 l S Stream data endstream endobj
稍后我们将了解m,
l和S运算符的作用。这些数字是以点为单位的测量值——一个点(或 pt)是 1/72 英寸。将此文档加载到 PDF 查看器的结果(按照第 2 章使用pdftk处理后)如图 5-1所示。
We’ll discover what the m,
l and S operators do in a moment. The numbers are
measurements in points—a point (or pt) is 1/72 inch.
The result of loading this document into a PDF viewer (after processing
with pdftk as per Chapter 2) is shown in Figure 5-1.
完整的手动创建文件(在使用pdftk处理之前)如示例 5-1所示。在本章的其余部分,我们将使用此文件的变体。大多数情况下,我们只会更改每个示例的内容流,但稍后我们需要向 PDF 添加一个或多个额外资源。所有这些文件都可以在本书的在线资源中找到。
The full manually created file (before processing with pdftk) is shown in Example 5-1. We’re going to be using variations on this file for the rest of this chapter. For the most part we’ll just change the content stream for each example, but later on we’ll need to add one or more extra resources to the PDF. All of these files are found in the online resources for this book.
示例 5-1。本章示例的骨架 PDF 列表
Example 5-1. Skeleton PDF listing for examples in this chapter
%PDF-1.0 PDF header 1 0 obj Page tree << /Kids [2 0 R] /Type /Pages /Count 1 >> endobj 2 0 obj Page object << /Rotate 0 /Parent 1 0 R /MediaBox [0 0 792 612] /Resources 3 0 R /Type /Page /Contents [4 0 R] >> endobj 3 0 obj Resources << >> 4 0 obj Page content stream << /Length 19 >> stream 200 150 m 600 450 l S endstream endobj 5 0 obj Document catalog << /Pages 1 0 R /Type /Catalog >> endobj xref Skeleton cross-reference table 0 6 trailer Trailer dictionary << /Root 5 0 R /Size 6 >> startxref 0 %%EOF End-of-file marker
内容流几乎总是压缩的,因此要检查现有文档的内容流,我们可以使用pdftk decompress操作。例如,命令:
Content streams are almost always compressed, so to inspect the
content stream of an existing document, we can use the pdftk decompress operation. For example, the command:
pdftk input.pdf decompress output output.pdf
使用未压缩的流input.pdf写入。output.pdf
writes input.pdf to output.pdf with the streams uncompressed.
内容流由一系列 运算符组成,每个运算符前面都有零个或多个 操作数。表 5-1列出了 6 组 78 个图形操作符。在本章中,我们将研究从前四组中选择的运算符。
A content stream consists of a series of operators, each preceded by zero or more operands. Table 5-1 lists the 78 graphics operators in 6 groups. In this chapter, we’ll be looking at selected operators from the first four groups.
表 5-1。PDF图形运算符
Table 5-1. PDF graphics operators
| 团体 | 用于 | 运营商 |
|---|---|---|
| 图形状态运算符 | 更改图形状态(当前颜色、笔划宽度等)。 | w J j M d ri i gs q Q cm CS cs SC
SCN sc scn G g RG rg K k |
| 路径构造算子 | 构建直线、曲线和矩形。 | m l c v y h re |
| 路径绘制运算符 | 描边和填充路径,或使用它们来定义剪裁区域。 | S s f F f* B B* b b* n W
W* |
| 其他涂装操作员 | 阴影图案和内联图像。 | sh BI ID EI Do |
| 文本运算符 | 以各种字体和方式选择和显示文本。 | Tc Tw Tz TL Tf Tr Ts Td TD Tm T*
Tj TJ ' '' d0 d1 |
| 标记内容和兼容性操作符 | 用于划分流的各个部分。 | MP DP BMC BDC EMC BX
EX |
通过依次考虑每个运算符及其操作数来呈现页面。图形状态始终保持不变,由一些操作员更改,由其他操作员查阅。操作数通常是数字,但也可以是名称、字典或数组。
The page is rendered by considering each operator and its operands in turn. The graphics state is maintained throughout, altered by some operators, consulted by others. Operands are often numbers, but can be names, dictionaries, or arrays.
表 5-2总结了呈现示例所需的图形状态部分,因为它可能出现在典型的 PDF 实现中。
The part of the graphics state which would be needed to render our examples, as it may appear in a typical PDF implementation, is summarized in Table 5-2.
我们使用横向美国信函页面(宽度 11 英寸或 792 磅;高度 8.5 英寸或 612 磅)。默认情况下,PDF 坐标系的原点位于页面的左下角, x和y分别向右和向上增加。
We’re using a landscape US Letter page (width 11 inches or 792 points; height 8.5 inches or 612 pts). The PDF coordinate system, by default, has the origin at the lower-left corner of the page, with x and y increasing rightward and upward, respectively.
让我们使用一些路径构造、描边和线条属性运算符来构建一个简单的图形流:
Let’s use some path construction, stroking, and line attribute operators to build a simple graphics stream:
100 100 m 300 200 l 700 100 l Move to (100, 100), line to (300, 200), line to (700, 100) S Stroke the line 8 w Change line width from the default (1.0) to 8.0 1 J Change line ending cap to rounded (code 1) from default square (code 0) 100 200 m 300 300 l 700 200 l Define new path, same shape but 100pts higher up the page S Stroke the new line [20] 0 d Change to 20pt dashes 100 300 m 300 400 l 700 300 l Define new path, same shape but another 100pts higher up the page S Stroke the new line
结果如图5-2所示。
The result is shown in Figure 5-2.
我们使用m运算符移动到新路径的起点,并使用l
运算符形成两条线。请注意,此时尚未绘制任何内容——仅当我们使用S操作符划线时页面才会受到影响。S运算符还清除当前路径。
We’ve used the m operator to move
to the start of the new path, and the l
operator to form two lines. Note that at this point, nothing has been
drawn—the page is only affected when we use the S operator to stroke the line. The S operator also clears the current path.
操作员将图形状态下的w线宽设置为 8 点。运算符将J行尾设置为圆角大写。破折号模式是用d运算符设置的,它有两个操作数:一个数组(它是破折号长度、间隙长度、破折号长度等的重复序列,在划线时循环)和初始偏移量(相位) 移动模式的开始。在我们的示例中,只有一个条目,因此破折号和间隙均为 20pt,相位为 0。
The w operator sets the line
width in the graphics state to 8 points. The J operator sets the line endings to rounded
caps. The dash pattern is set with the d operator, which takes two operands: an array
(which is a repeating sequence of dash length, gap length, dash length
etc, which are cycled through when stroking the line), and an initial
offset (the phase) which moves the start of the
pattern. In our example, there is just one entry, so dashes and gaps are
both 20pt, and the phase is 0.
表 5-3、表 5-4和表 5-5分别总结了线连接、破折号模式和线帽。
Line joins, dash patterns, and line caps are summarized in Table 5-3, Table 5-4, and Table 5-5, respectively.
路径可以由多个子路径组成,每个子路径都以运算符开头m
。这可用于定义由多个不连续形状组成的单个路径。
Paths may be made from more than one subpath,
each subpath starting with the m
operator. This can be used to define a single path made from several
discontiguous shapes.
表 5-4。破折号图案
Table 5-4. Dash patterns
| 破折号图案规范 | 意义 |
|---|---|
[] 0 | 实线 |
[2] 0 | 2 开,2 关,2 开…… |
[2] 1 | 1 开,2 关,2 开...(相位设置为 1) |
[2 3] 0 | 2开,3关,2开... |
除了直线,我们还可以画曲线。定义曲线有许多不同的可能方案,但业界已确定以汽车工程师皮埃尔·贝塞尔的名字命名的贝塞尔曲线。它们很容易用鼠标在屏幕上进行操作且可预测,以任何分辨率或精度绘制都相对容易,并且数学定义也很简单。
As well as straight lines, we can draw curves. There are many different possible schemes for defining curves, but the industry has settled on Bézier curves, named for the automobile engineer Pierre Bézier. They are easy and predictable to manipulate with the mouse onscreen, relatively easy to draw at any resolution or accuracy, and simple to define mathematically.
一条曲线由四个点定义——起点和终点,以及定义曲线在起点和终点之间的形状的两个控制点。曲线不一定通过控制点,但始终完全位于由其四个点定义的凸四边形内。
A curve is defined by four points—the start and end points, and two control points which define how the curve is shaped between start and end. The curve does not necessarily pass through the control points, but always sits fully inside the convex quadrilateral defined by its four points.
在图 5-3中可以看到一个示例曲线,显示起点和终点以及两个控制点(从终点用虚线显示,因为它们可能在图形编辑器中表示)。这是通过使用c运算符生成的:
An example curve, showing the start and end points and the two
control points (shown with dotted lines from the end points, as they may
be represented in a graphics editor) can be seen in Figure 5-3. This was generated by using the c operator:
300 200 m 400 300 500 400 600 200 c S
我们使用m运算符将当前点移动到曲线的起点。操作员再取三个坐标:第c一个控制点、第二个控制点和终点。
We use the m operator to move
the current point to the start of the curve. The c operator takes three more coordinates: the
first control point, second control point, and end point.
有关 Bézier 曲线的更多信息,请参阅图形文本 — 请参阅PDF 和图形文档。
For more information on Bézier curves, consult a graphics text—seePDF and Graphics Documentation.
有趣的是,不可能在 PDF 中绘制精确的圆圈。但是我们可以使用几条贝塞尔曲线来接近一条。我们将使用四个对称曲线(获得良好结果的最小数量),每个象限一条。以(1, 0)为中心的单位圆的一个象限试样,其坐标如图5-4所示。数字k约为 0.553。
Interestingly, it’s not possible to draw exact circles in PDF. But we can use several Bézier curves to approximate one closely. We’ll use four symmetric curves (the minimum number to get a good result), one for each quadrant. For a specimen quadrant of the unit circle centered at (1, 0), the coordinates are shown in Figure 5-4. The number k is about 0.553.
通过用表 5-6中的另一个运算符替换S我们之前使用的操作(这里,我们用于
B填充和描边路径),可以填充和描边路径。图 5-5显示了使用以下代码填充和描边的形状:
Paths may be filled as well as stroked, by substituting another
operator from Table 5-6 for the S operation we used before (here, we used
B to fill and stroke the path). Figure 5-5 shows a shape filled and stroked using the
following code:
2.0 w 0.75 g Change fill color to light Gray 250 250 m Move to start of path 350 350 450 450 550 250 c First curve 450 250 350 200 y Second curve h B Close and fill
我们使用g运算符来设置填充颜色。这在颜色和颜色空间中有解释。对于第二条曲线,我们使用了
y类似于 的运算符c,除了第二个控制点和终点是一个并且相同,所以只需要四个操作数。
We’ve used the g operator to
set the fill color. This is explained in Colors and Color Spaces. For the second curve, we’ve used the
y operator which is like c, except that the second control point and
the end point are one and the same, so only four operands are
needed.
填充运算符有两个区别:
There are two factors distinguishing fill operators from one another:
路径是否
在填充前自动关闭。关闭涉及从当前点到当前子路径的起点添加一条直线段。该路径可以由h操作员手动关闭。
Whether the path is automatically closed
before filling. Closing involves adding a straight line segment from
the current point to the starting point of the current subpath. The
path may be manually closed with the h operator.
缠绕规则确定填充自相交或由多个重叠子路径组成的对象时所做的选择。图 5-6显示了两个缠绕规则对自相交对象和由两个重叠的矩形子路径构成的路径的影响。
The winding rule which determines the choices made when filling an object which is self-intersecting or made up of multiple subpaths which overlap. Figure 5-6 shows the effect of the two winding rules on both a self-intersecting object, and a path made from two overlapping rectangular subpaths.
图 5-6的代码是:
The code for Figure 5-6 is:
100 350 200 200 re 120 370 160 160 re f Non-zero 400 350 200 200 re 420 370 160 160 re f* Even-odd 150 50 m 150 250 l 250 50 l 50 150 l 350 150 l h f 550 50 m 550 250 l 650 50 l 450 150 l 750 150 l h f*
在这里,我们还使用了re
运算符。这会在给定四个参数的情况下创建一个矩形的闭合路径:最小x、最小y、宽度和高度。
Here, we’ve also used the re
operator. This creates a rectangular, closed path given four arguments:
minimum x, minimum y, width,
and height.
表 5-6。填充和描边路径的运算符
Table 5-6. Operators for filling and stroking paths
| 操作员 | 功能 |
|---|---|
n | 在没有视觉效果的情况下结束路径。这用于更改当前剪辑路径(请参阅剪辑)。 |
b | 关闭、填充和描边路径(非零缠绕规则) |
b* | 关闭、填充和描边路径(奇偶缠绕规则) |
B | 填充和描边路径(非零缠绕规则) |
B* | 填充和描边路径(奇偶缠绕规则) |
f要么F | 填充路径(非零缠绕规则) |
f* | 填充路径(奇偶缠绕规则) |
S | 描边路径 |
s | 关闭并描边路径 |
要更改 PDF 图形流中的填充或描边颜色,我们需要使用一个操作符更改当前颜色空间,然后使用另一个操作符更改颜色。填充和描边颜色空间是分开的——例如,当前填充颜色空间可以是 DeviceRGB,描边颜色空间可以是DeviceGray。
To change the fill or stroke color in a PDF graphics stream, we need to change the current color space using one operator, and then change the color using another. Fill and stroke color spaces are separate—the current fill color space could be DeviceRGB and the stroke color space DeviceGray, for example.
在本节中,我们将了解基本的 DeviceGray、DeviceRGB和 DeviceCMYK颜色空间(更复杂的颜色空间包含在 PDF 标准中):
In this section, we look at the basic DeviceGray, DeviceRGB, and DeviceCMYK color spaces (more complicated color spaces are covered in the PDF Standard):
DeviceGray颜色空间有一个附加组件,从 0.0(黑色)到 1.0(白色)不等。
The DeviceGray color space has one additive component, which varies from 0.0 (Black) to 1.0 (White).
DeviceRGB颜色空间具有红色、绿色和蓝色三个附加组件。它们的范围从 0.0(例如,没有红色)到 1.0(例如,全红色)。
The DeviceRGB color space has three additive components for Red, Green, and Blue. They each range from 0.0 (e.g., no Red) to 1.0 (e.g., full Red).
DeviceCMYK颜色空间有青色、品红色、黄色和主色(黑色)四种减色成分。它们的范围从 0.0(无色素)到 1.0(全色素)。
The DeviceCMYK color space has four subtractive components for Cyan, Magenta, Yellow, and Key (Black). They each range from 0.0 (no pigment) to 1.0 (full pigment).
要更改笔画颜色空间,我们使用CS运算符。要更改填充颜色空间,请
cs改用。然后SC可以使用运算符(具有等于当前颜色空间中的组件数的操作数)来设置描边颜色或sc设置填充颜色。例如:
To change the stroke color space, we use the CS operator. To change the fill color space, use
cs instead. The SC operator (with a number of operands equal to
the number of components in the current color space) can then be used to
set the stroke color, or sc to set the
fill color. For example:
/DeviceRGB CS Set stroke color space 0.0 0.5 0.9 SC Set color to RGB (0.0, 0.5, 0.9)
设备颜色空间有快捷运算符,可以一次设置当前描边或填充颜色空间和当前描边或填充颜色。这些总结在表 5-7中。
There are shortcut operators for the device color spaces, which set the current stroke or fill color space and the current stroke or fill color in one operation. These are summarized in Table 5-7.
表 5-7。简单的颜色和颜色空间运算符
Table 5-7. Simple color and color space operators
| 操作员 | 操作数 | 功能 |
|---|---|---|
G | 1个 | 将描边颜色空间更改为/DeviceGray并设置颜色 |
g | 1个 | 将填充颜色空间更改为/DeviceGray并设置颜色 |
RG | 3(红、绿、蓝) | 将描边颜色空间更改为/DeviceRGB并设置颜色 |
rg | 3(红、绿、蓝) | 将填充颜色空间更改为/DeviceRGB并设置颜色 |
K | 4 (中, 男, 是, 韩) | 将描边颜色空间更改为/DeviceCMYK并设置颜色 |
k | 4 (中, 男, 是, 韩) | 将填充颜色空间更改为/DeviceCMYK并设置颜色 |
当内容流开始时,默认颜色空间为/DeviceGray并且默认颜色值为 0(全黑),因此我们可以g
直接使用运算符:
When a content stream begins, the default color space is /DeviceGray and the default color value is 0
(fully black), so we can use the g
operator straight away:
200 250 100 100 re f 0.25 g 300 250 100 100 re f 0.5 g 400 250 100 100 re f 0.75 g 500 250 100 100 re f
结果如图5-7所示。
The result is shown in Figure 5-7.
到目前为止,我们已经看到操作符会改变其后所有操作符的图形状态。为了使我们能够将图形对象及其属性(如颜色)组合在一起,我们可以用qand
Q运算符将一组运算符括起来。q操作员搁置当前图形状态。然后可以像往常一样更改状态、绘制对象等等。调用运算符时Q,将恢复先前保存的状态。这些q/Q对可以嵌套,一对嵌套在另一对中:
So far, we’ve seen operators that alter the graphics state of all
the operators that follow them. In order to allow us to group together
graphics objects with their attributes (such as color), we can bracket a
group of operators with the q and
Q operators. The q operator puts aside the current graphics
state. The state may then be altered, objects painted, and so on— as
usual. When the Q operator is invoked,
the previous saved state is restored. The q/Q pairs may be nested, one pair inside
another:
0.75 g Change to light Gray fill 250 250 100 100 re f q Save the graphics state 0.25 g Change to dark Gray fill 350 250 100 100 re f Q Retrieve the previous graphics state 450 250 100 100 re f Light Gray again
流中的q/Q运算符必须形成平衡对(例外情况是,在图形流的末尾,Q
可以省略任何剩余的运算符)。结果如图5-8所示。
The q/Q operators in a stream
must form balanced pairs (with the exception that, at the end of a
graphics stream, any remaining Q
operators may be omitted). The result is shown in Figure 5-8.
对最常见的用途之一q/Q是隔离
坐标变换的影响。我们可以使用cm运算符来改变从
用户空间坐标到设备空间坐标的转换。这被称为电流变换矩阵(CTM)。重要的是,这种对图形状态的更改是由一q/Q对隔离的,因为撤消它很复杂。
One of the most frequent uses of q/Q pairs is to isolate the effects of
coordinate transforms. We can use the cm operator to change the transformation from
user space coordinates to device
space coordinates. This is known as the Current
Transformation Matrix (CTM). It’s important that this change
to the graphics state is isolated by a q/Q pair, because it’s complicated to
undo.
该cm运算符有六个参数,表示要与 CTM 组成的矩阵。以下是基本变换:
The cm operator takes six
arguments, representing a matrix to be composed with the CTM. Here are the
basic transforms:
(dx, dy)的转换由 1, 0, 0, 1, dx, dy指定
Translation by (dx, dy) is specified by 1, 0, 0, 1, dx, dy
按(sx, sy)关于(0, 0)的缩放由sx , 0, 0, sy, 0, 0指定
Scaling by (sx, sy) about (0, 0) is specified by sx, 0, 0, sy, 0, 0
围绕(0, 0)逆时针旋转x弧度由cos x, sin x, -sin x, cos x, 0, 0指定
Rotating counterclockwise by x radians about (0, 0) is specified by cos x, sin x, -sin x, cos x, 0, 0
运算符将给定的cm转换附加到 CTM,而不是替换它。要围绕任意点(而不是原点)旋转或缩放,平移到原点,旋转或缩放,然后平移回来。
The cm operator appends the given
transform to the CTM, rather than replacing it. To rotate or scale around
an arbitrary point (rather than the origin), translate to the origin,
rotate or scale, and translate back.
任何图形文本都会对此类变换的数学进行全面讨论。请参阅PDF 和图形文档。
Any graphics text will have a full discussion of the mathematics of such transforms. See PDF and Graphics Documentation.
考虑以下内容,如图 5-9 所示:
Consider the following, illustrated in Figure 5-9:
2.0 w 0.75 g 100 100 m 200 200 300 300 400 100 c (a) Untransformed shape 300 100 200 50 y h B q 0.96 0.25 -0.25 0.96 0 0 cm (b) Rotate counterclockwise by 1/4 radian 100 100 m 200 200 300 300 400 100 c 300 100 200 50 y h B Q q 0.5 0 0 0.5 0 0 cm (c) Scale original shape by 0.5 about the origin 100 100 m 200 200 300 300 400 100 c 300 100 200 50 y h B 1 0 0 1 300 0 cm (d) Translate (c) by 300 units in the new space, i.e., 150 units in the original space 100 100 m 200 200 300 300 400 100 c 300 100 200 50 y h B Q
注意使用q和Q来隔离变换的效果。
Note the use of q and Q to isolate the effect of transforms.
我们可以使用以通常方式构建的路径来设置
剪切路径。从那时起,只会显示路径区域内的内容。这是通过使用W运算符(对于非零路径)或W*运算符(对于奇偶路径)来完成的。
We can use a path, built in the usual way, to set the
clipping path. From that point on, only content
within the path’s area will be shown. This is done by using the W operator (for a non-zero path) or W* operator (for an even-odd path).
该运算符将给定的路径与现有的裁剪路径相交,因此只能用于使裁剪区域变小,而不能变大。裁剪路径仍然是当前路径,因此它可以用来描边裁剪区域的轮廓,例如使用S运算符。该W运算符是绘画操作的修饰符,因此如果我们不想描边新剪切路径的轮廓,我们必须替换为无操作路径绘画运算符n。这是我们定义剪切路径的示例:
The operator intersects the path given with the existing clipping
path, so it can only be used to make the clipping region smaller, not
larger. The clipping path remains the current path, so it can be used to
stroke the outline of the clipping region using, for example, the S operator. The W operator is a modifier to the painting
operation, so if we don’t want to stroke the outline of the new clipping
path, we must substitute the no-op path painting operator n. Here’s an example where we define a clipping
path:
200 100 m 200 500 l 500 100 l h W S
这里我们定义了一个封闭的三角形路径,使用 设置裁剪区域W,然后使用 对其进行描边S。设置这个剪切路径然后绘制与图5-2相同的场景的结果可以在图5-10中看到。
Here we have defined a closed triangular path, set the clipping
region using W and then stroked it
using S. The result of setting this
clipping path and then drawing the same scene as Figure 5-2 can be seen in Figure 5-10.
PDF 有一个精密但复杂的透明机制,它在多个颜色空间中工作,允许不同类型的混合,并支持分组透明。我们这里只考虑简单的透明度。
PDF has a sophisticated but complicated transparency mechanism which works in multiple color spaces, allows different types of blending, and supports grouped transparencies. We only consider simple transparency here.
没有特定的透明度运算符,因此我们使用运算符从页面资源条目中的条目gs加载填充透明度级别。该
条目是外部图形状态集合的字典,我们可以使用运算符加载它。/ca/ExtGState/ExtGStategs
There are no specific transparency operators so we use the gs operator to load the fill transparency level
from the /ca entry in the /ExtGState entry in the page’s resources. The
/ExtGState entry is a dictionary of
collections of external graphics state, which we can
load in using the gs operator.
对于我们的示例,资源仅包含/ExtGState条目和单个状态集合,称为/gs1. 它只包含/ca填充透明度的条目:
For our example, the resources consist of just the /ExtGState entry, with a single collection of state, called /gs1. It contains just the /ca entry for fill transparency:
<< /ExtGState
<< /gs1
<< /ca 0.5 >> Half transparent
>>
>>这是相应的内容流:
Here is the corresponding content stream:
2.0 w Select 2pt line width /gs1 gs Select /gs1 from external graphics state 0.75 g Select light Gray 200 250 m 300 350 400 450 500 250 c 400 250 300 200 y h B 1 0 0 1 100 100 cm 200 250 m 300 350 400 450 500 250 c 400 250 300 200 y h B
结果如图5-11所示。透明度被定义0为完全透明和1完全不透明。可以更改笔画透明度/CA以代替(或除此之外)/ca。
The result is shown in Figure 5-11. The
transparency is defined so that 0 means
wholly transparent, and 1 wholly
opaque. The stroke transparency may be altered with /CA in place of (or in addition to) /ca.
除了纯色,PDF 还允许使用各种 图案来填充和描边对象:
As well as plain colors, PDF allows various patterns to be used to fill and stroke objects:
平铺图案,其中图案单元格在页面上复制。
Tiling patterns, where a pattern cell is replicated over the page.
阴影图案,其中颜色之间的渐变用于填充对象。有很多类型,有很多选项和设置:
| 基于功能 |
| 轴向 |
| 径向 |
| 自由形式的 Gouraud 阴影三角形网格 |
| 格子形式的 Gouraud 阴影三角形网格 |
| Coons 贴片网 |
| 张量积贴片网格 |
Shading patterns, where a gradient between colors is used to fill an object. There are many types, with many options and settings:
| Function-based |
| Axial |
| Radial |
| Free-form Gouraud-shaded triangle mesh |
| Lattice-form Gouraud-shaded triangle mesh |
| Coons patch mesh |
| Tensor-product patch mesh |
我们只考虑轴向和径向阴影。
We consider just Axial and Radial shadings.
通过使用运算符更改为/Pattern颜色空间cs,然后使用scn运算符选择命名模式来调用模式。/Pattern
模式在页面资源的字典中按名称列出。例如:
Patterns are invoked by changing to the /Pattern color space using the cs operator, then using the scn operator to select a named pattern. Patterns
are listed by name in the /Pattern
dictionary in the page’s resources. For example:
/Pattern
<<
/GradientShading Our name for the pattern
<<
/Type /Pattern
/PatternType 2 A shading pattern
/Shading
<<
/ColorSpace /DeviceGray
/ShadingType 2 A linear shading
/Function << /FunctionType 2 /N 1 /Domain [0 1] >>
/Coords [150 200 450 500] Coordinates of start and end of gradient
/Extend [true true]
>>
>>
>>这定义了轴向阴影图案。我们已经命名了我们的模式
/GradientShading。阴影的图案类型是2. 我们的阴影定义为:
This defines an axial shading pattern. We have named our pattern
/GradientShading. The pattern type for
shadings is 2. Our shading is defined
by:
色彩空间/DeviceGray
The color space /DeviceGray
阴影类型 2(轴向)
The shading type 2 (Axial)
底纹的起点和终点坐标:(150, 200)和(450, 500)
The coordinates of the start and end of the shading: (150, 200) and (450, 500)
我们不在这里讨论/Extendor
/Function条目。现在调用该模式,并绘制一个形状:
We don’t discuss the /Extend or
/Function entries here. The pattern is
now invoked, and a shape drawn:
/Pattern cs Choose pattern color space for fills /GradientShading scn Choose our pattern as a color 250 300 m 350 400 450 500 550 300 c 450 300 350 250 y h f
结果如图 5-12 所示。
The result is Figure 5-12.
如果我们通过更改 to 更改为径向阴影/ShadingType,3并将/Coords条目更改为[400 400 0 400 400 200]— 内半径为 0 和外半径为 200 的径向阴影均以 (400, 400) 为中心:
If we change to a radial shading by changing the /ShadingType to 3, and change the /Coords entry to [400 400 0 400 400 200]—a radial shading with
inner radius 0 and outer radius 200 both centered on (400, 400):
/Coords [400 400 0 400 400 200] /ShadingType 3
结果如图5-13所示。
The result is shown in Figure 5-13.
在Transformations中,我们使用q和Q
运算符使用各种转换来显示单个对象。然而,我们每次都必须背诵绘制对象的操作。Form XObject允许我们存储一组图形指令,并以不同的比例和位置重复使用它们(甚至在不同的页面上)。
In Transformations, we used the q and Q
operators to display a single object using various transformations.
However, we had to recite the operations for drawing the object each time.
A Form XObject allows us to store a set of graphics
instructions, and use them repeatedly (even on different pages), at
differing scales and positions.
Form XObjects 与 PDF 表单(您填写的那种)无关。
Form XObjects have nothing to do with PDF forms (the kind you fill in).
3 0 obj Resources of current page << /XObject << /X1 5 0 R >> Our XObject is called /X1 >> endobj 5 0 obj The XObject itself << The XObject dictionary /Type /XObject /Subtype /Form /Length 69 /BBox [0 0 792 612] >> stream The XObject content 2.0 w 0.5 g 250 300 m 350 400 450 500 500 300 c 450 300 350 250 y h B endstream endobj
上面清单中的对象 3 是页面的/Resources条目。它的/XObject条目是一个字典,列出了该页面中使用的 XObject。我们已将我们的 XObject 命名为/X1。对象 5 是 XObject 本身。它是一个流,其字典中包含以下条目:
Object 3 in the listing above is the page’s /Resources entry. Its /XObject entry is a dictionary listing the
XObjects used in that page. We’ve named our XObject /X1. Object 5 is the XObject itself. It’s a
stream, with the following entries in its dictionary:
该/Type对象的 是
/XObject。
The /Type of this object is
/XObject.
此/SubtypeXObject 的 是/Form,将其区分为一种形式的 XObject。
The /Subtype of this XObject
is /Form, distinguishing it as a
form XObject.
和往常一样,是流的/Length字节长度。
The /Length is the length in
bytes of the stream, as usual.
该/BBox条目为 XObject 提供了一个边界框,在本例中与页面本身相同。
The /BBox entry gives a
bounding box for the XObject, in this case the same as the page
itself.
该流包含设置线条和宽度的代码,以及形状本身。现在,我们可以使用主内容流中的 XObject,使用Do以 XObject 名称作为操作数的运算符:
The stream contains the code for setting up the line and width, and
the shape itself. Now, we can use the XObject from the main content
stream, using the Do operator with the
XObject’s name as the operand:
/X1 Do Invoke XObject /X1 0.5 0 0 0.5 0 0 cm Scale by 0.5 about the origin /X1 Do Invoke the XObject again, at the new scale
结果如图5-14所示。
The result is shown in Figure 5-14.
遇到Do运算符时,保存当前图形状态,/Matrix将 XObject 中的条目(如果有)与 CTM 连接,绘制内容(由 XObject 剪裁
/BBox),并恢复当前图形状态。
When the Do operator is
encountered, the current graphics state is saved, the /Matrix entry (if any) from the XObject is
concatenated with the CTM, the content is drawn (clipped by the XObject’s
/BBox), and the current graphics state
is restored.
图像使用单独的对象指定,再次存储在
/XObject页面资源字典的条目中。因此,它们与图形内容流是分开的,因此可以多次重复使用,甚至可以跨页面重复使用。要指定图像,我们提供图像数据(通常使用 JPEG 等多种机制之一进行压缩)、宽度和高度,以及一些描述从图像数据到其颜色空间中的值的转换的参数。
Images are specified using separate objects, again stored in the
/XObject entry in the page’s resources
dictionary. They are thus separate from the graphics content stream, and
so may be reused multiple times, even across pages. To specify an image,
we provide the image data (usually compressed using one of many mechanisms
such as JPEG), its width and height, and some parameters which describe
the conversion from the image data to values in its color space.
这是图像 XObject 的资源条目:
Here is a resources entry for an image XObject:
<< /XObject << /X2 5 0 R >> >>
<< /XObject << /X2 5 0 R >> >>
这定义了一个名为 XObject 的图像/X2,其参数为:
This defines an image XObject called /X2 whose parameters are:
5 0 obj << /Type /XObject It's an XObject /Subtype /Image It's an image /ColorSpace /DeviceGray The color space of the image. Also determines how many components it has. /Length 8 The length of the stream in bytes, as usual /Width 8 Image width in pixels /Height 8 Image height in pixels /BitsPerComponent 1 Number of bits used for each component >> stream @`pxxp`@ The image data endstream
为了使手动输入成为可能,我们定义了一个每像素一位的黑白图像,仅包含 64 位数据。通常,图像在每个方向上都有数百或数千个像素,每个分量最多 16 位,有一个、三个或四个分量。
To make this possible to type in manually, we’ve defined a one-bit-per-pixel black and white image, containing just 64 bits of data. Typically, images would be hundreds or thousands of pixels in each direction and with up to 16 bits per component, with one, three, or four components.
图像总是映射到用户空间中的正方形(0,0)...(1,1)
,因此我们使用cm运算符将图像缩放到适当的大小和位置:
Images always map to the square (0,0)...(1,1)
in user space, so we use cm operators
to scale the image to the appropriate size and position:
q 1 0 0 1 100 100 cm Translate 200 0 0 200 0 0 cm Scale /X2 Do Invoke image XObject Q q 1 0 0 1 400 100 cm And again with a different position and scale 100 0 0 100 0 0 cm /X2 Do Q
结果如图5-15所示。
The result is shown in Figure 5-15.
在上一章中,我们看到了如何使用一系列图形运算符通过引用它们的操作数和基于堆栈的图形状态在页面上绘制内容。
In the previous chapter, we saw how a series of graphics operators can be used to draw content on a page, by reference to their operands and a stack-based graphics state.
在本章中,我们将了解用于从字体中选择字符并将其打印在页面上的运算符和状态。然后,我们会看到字体及其规格是如何定义和嵌入到 PDF 文档中的。最后,我们讨论了从文档中提取通用文本的复杂任务。
In this chapter, we look at the operators and state for selecting characters from fonts and printing them on the page. Then, we see how fonts and their metrics are defined and embedded in PDF documents. Finally, we discuss the complex task of general-purpose text extraction from a document.
可以定义一种页面描述语言,其中没有执行任何文本布局,并且提供纯文本以及要即时填充的框和列,就像桌面发布包一样。相反,可以定义一种完全没有字体或文本的页面描述语言,仅依赖于在生成文档时将文本转换为轮廓形状,例如在文字处理器中进行布局。
It would be possible to define a page description language where none of the text layout had been performed, and plain text was supplied along with boxes and columns to be filled on-the-fly, just like a desktop publishing package. Conversely, it would be possible to define a page description language without fonts or text as such at all, just relying on text being converted to outline shapes as the document is produced, having been layed out in, for example, a word-processor.
PDF采用了中间立场——保留了字体和小范围文本排版的思路,但大范围的段落排版必须提前做好。这样做有以下优点:
PDF adopts a middle ground—the ideas of a font and of small-scale text layout are retained, but the large-scale paragraph layout must be done in advance. This has the following advantages:
完全控制布局,因为大规模布局(段落、换行符)是生成 PDF 的程序的工作。该文档将看起来像它应该的样子。
Complete control over layout, because large-scale layout (paragraphs, line-breaks) are the job of the program producing the PDF. The document will look as it is supposed to.
支持固定字符间距的字符串等可预测的小规模文本布局,因此无需明确说明每个字符的位置。
Predictable small-scale text layout such as strings with fixed character spacing is supported, so the position of each character need not be explicitly stated.
通过使用字体作为字符形状库节省的空间,以及现有字体文件的简单包含最大限度地减少了兼容性和可移植性问题。
Space saved by the use of fonts as libraries of character shapes, and the simple inclusion of existing font files minimizing compatibility and portability problems.
保留了原始字符和一些布局元素,因此通常可以进行复制粘贴和文本提取。
Original characters and some layout elements are maintained, so copy-and-paste and text extraction are normally possible.
表 6-1总结了文本状态参数和修改它们的操作符。
The text state parameters and the operators which modify them are summarized in Table 6-1.
表 6-1。文本状态参数及其运算符
Table 6-1. Text state parameters and their operators
| 范围 | 描述 | 操作数 | 运营商 | 初始值 |
|---|---|---|---|---|
| 温度_ | 字符间距 | 字符空间 | Tc将字符间距设置为charSpace,以未缩放的文本空间单位表示。 | 0 |
| 瓦_ | 字间距 | 文字空间 | Tw将字间距设置为wordSpace,以未缩放的文本单位表示。 | 0 |
| th _ | 水平间距 | 规模 | Tz将水平缩放设置为 ( scale / 100)。 | 100(正常间距) |
| 升_ | 领导 | 领导 | TL将文本设置为leading,以未缩放的文本空间单位表示。 | 0 |
| T f , T f s | 字体、字号 | 字体大小 | Tf选择
大小点的字体字体
。 | 没有任何。必须指定。 |
| T模式 | 渲染模式 | 使成为 | Tr将文本渲染模式设置为render,一个整数。 | 0 |
| 上升_ | 上升 | 上升 | Ts将文本设置为
rise,以未缩放的文本空间单位表示。 | 0 |
我们在
文本空间和文本定位中讨论了短语“ unscaled text space units ”。文本状态与图形状态一起存储,并使用上面的运算符进行操作。当前文本状态受堆栈运算符和影响,就像图形状态一样。qQ
We discuss the phrase “unscaled text space units” in
Text Space and Text Positioning. The text state is stored along with the
graphics state, and manipulated using the operators above. The current
text state is affected by the stack operators q and Q, just
like the graphics state.
在页面上打印文本需要:
Printing text on the page requires:
选择字体。
Selecting a font.
选择位置、大小和方向。
Choosing position, size, and orientation.
选择间距、颜色、文本呈现模式和其他参数。
Choosing spacing, color, text rendering mode, and other parameters.
从字体中选择字符,并将它们显示在页面上。
Selecting characters from the font, and showing them on the page.
运算符BT(begin text) 和(end text) 在文本部分ET周围形成括号。用于在页面内容流中显示文本的运算符只能出现在和之间。但是,用于更改文本状态的运算符不受此限制。文本部分还可能包含其他更改一般图形状态的运算符。BTET
The operators BT (begin text)
and ET (end text) form brackets
around text sections. Operators for showing text
in a page’s content stream may only appear between BT and ET.
Operators for altering text state, however, are not restricted in this
way. Text sections may also contain other operators altering the general
graphics state.
作为例子,我们回到“你好,世界!”来自第 2 章的文件:
As an example, we return to the “Hello, World!” file from Chapter 2:
1. 0. 0. 1. 50. 700. cm Position at (50, 700) BT Begin text block /F0 36. Tf Select /F0 font at 36pt (Hello, World!) Tj Place the text string ET End text block
在这里,我们使用Tf
带有字体名称和大小运算符的运算符来选择字体,并使用
Tj运算符来显示文本字符串。我们依靠图形运算符cm来定位文本。现在,我们将讨论更改文本位置的其他方法。
Here, we’ve used the Tf
operator with font name and size operators to select the font, and the
Tj operator to show a text string. We
have relied on the graphics operator cm to position the text. Now, we will discuss
other methods of changing the text position.
文本空间是定义文本的坐标系。从这个文本空间到用户空间(然后像往常一样到设备空间)的转换决定了文本在页面上的位置。文本字符串中第一个字形的原点位于文本空间的原点。
Text space is the coordinate system in which text is defined. The transformation from this text space into user space (and then into device space, as usual) determines where text is placed on the page. The origin of the first glyph in the text string is placed at the origin of text space.
有两个矩阵需要考虑:
There are two matrices to consider:
文本矩阵,它定义了下一个字形的当前转换。它由文本定位和文本显示运算符改变。
The text matrix, which defines the current transformation for the next glyph. It is altered by the text positioning and text showing operators.
文本行矩阵,这是文本矩阵在当前行开头的状态。因此,可以通过使用操作员移动到下一行来垂直对齐文本行,而无需手动跟踪行的开始位置。
The text line matrix, which is the state of the text matrix at the beginning of the current line. Thus, lines of text may be aligned vertically by the use of an operator to move to the next line, without manually keeping track of the position of the start of the line.
这些矩阵不会从文本部分持续到文本部分,而是在每个文本部分的开头重置为单位矩阵。这两个矩阵与字体大小、水平缩放和文本上升一起定义了从文本空间到用户空间的转换。
These matrices do not persist from text section to text section, but are reset to the identity matrix at the beginning of each text section. Together with the font size, horizontal scaling, and text rise, these two matrices define the transformation from text space to user space.
表 6-2总结了用于修改文本位置的运算符 。
The operators for modifying the text position are summarized in Table 6-2.
表 6-2。定位文本的运算符
Table 6-2. Operators for positioning text
| 操作数 | 操作员 | 功能 |
|---|---|---|
| 坐标轴_ | Td | 将文本位置移动到下一行,偏移量为 ( x , y )。参数以未缩放的文本空间单位表示。 |
| 坐标轴_ | TD | 将文本位置移动到下一行,偏移量为 ( x , y )。将前导设置为-y。参数以未缩放的文本空间单位表示。 |
| - | T* | 将文本位置移动到下一行。相当于序列0
前导 Td(其中前导
是当前文本的前导)。 |
| a , b , c , d , e , f | Tm | 将文本矩阵和文本行矩阵设置为[a b 0 c d 0 e f 1]。与图形矩阵运算符不同cm,矩阵替换当前矩阵,而不是与它连接。 |
运算符在Tj当前位置显示文本。这与我们已经看到的文本定位运算符结合就足够了。但是,为了方便和简洁,提供了三个额外的运算符('、''和
TJ)。这些是文本显示和文本定位的常见组合的快捷方式。表 6-3总结了显示运算符的文本。
The Tj operator shows text at
the current position. This, in combination with the text positioning
operators we have already seen would suffice. However, for convenience
and brevity, three additional operators (', '', and
TJ) are provided. These are shortcuts
for common combinations of text-showing and text-positioning. The text
showing operators are summarized in Table 6-3.
表 6-3。显示文本的运算符
Table 6-3. Operators for showing text
| 操作数 | 操作员 | 功能 |
|---|---|---|
| 细绳 | Tj | 在当前位置显示字符串。 |
| 细绳 | ' | 转到下一行,考虑前导矩阵和文本矩阵,并在新位置显示字符串。T*与使用后跟相同Tj。 |
| 字空间, 字符空间, 字符串 | '' | 将字间距设置为wordspace
,将字符间距设置为charspace。转到下一行,考虑前导矩阵和文本矩阵,并在新位置显示字符串。相同的序列wordspace
Tw
charspace Tc string
'。 |
| 大批 | TJ | 此运算符允许显示文本字符串,并对各个字形位置进行调整(例如,字距调整)。该数组包含任意组合的字符串和数字。字符串条目显示正常;number 条目通过减去该数量(以文本空间单位的千分之一表示)来水平调整文本矩阵。 |
我们现在将通过一些显示文本的示例,为简单起见,使用标准字体和基于 Latin-1 的 PDFDocEncoding。与往常一样,这些示例可以在在线资源中找到。
We will now go through some examples of showing text, using the standard font and the Latin-1 based PDFDocEncoding for simplicity. As always, these examples can be found in the online resources.
这是我们的第一个示例,其中我们使用各种运算符显示了一些文本行。结果如图 6-1 所示:
Here is our first example, where we show some lines of text using various operators. The result is illustrated in Figure 6-1:
BT /F0 36 Tf 1 0 0 1 120 350 Tm 50 TL (Character and Word Spacing) Tj T* 3 Tc (Character and Word Spacing) Tj T* 10 Tw (Character and Word Spacing) Tj ET
在这个例子中,我们有:
In this example we have:
用于Tf选择
/F036 点的字体。
Used Tf to select font
/F0 at 36 points.
用于Tm将文本位置设置为 (120, 350)。
Used Tm to set the text
position to (120, 350).
用于TL将领先设置为 50 点。
Used TL to set the
leading to 50 points.
显示带有Tj, 的字符串,用于T*移动到下一行。
Shown a string with Tj,
and used T* to move to the next
line.
将字符间距设置为 3 点,并再次绘制字符串。
Set the character spacing to 3 points, and drawn the string again.
将字间距设置为 10 点,第三次绘制字符串。
Set the word spacing to 10 points, and drawn the string a third time.
在此示例中,我们展示了文本转换如何与图形转换相结合以确保文本定位操作(例如,移动到下一行)正常工作,即使整个文本部分都已转换。结果如图 6-2 所示:
In this example, we show how text transforms combine with graphics transforms to make sure that text positioning operations (for example, moving to the next line) work properly, even when the whole text section is transformed. The result is Figure 6-2:
0.96 0.25 -0.25 0.96 0 0 cm BT /F0 48 Tf 48 TL 1 0 0 1 270 240 Tm (Text and graphics) Tj T* (transforms combined) Tj T* (with newlines) Tj ET
在这里,我们有:
Here, we have:
用 设置图形矩阵绕原点逆时针旋转cm。
Set up the graphics matrix to rotate anticlockwise around
the origin with cm.
Tf选择一种字体并使用和设置行距TL。
Selected a font and set the leading with Tf and TL.
将文本矩阵设置为偏移 (270, 240) 的起始位置
Tm。
Set the text matrix to offset the start by (270, 240) with
Tm.
Tj用和写三行T*。
Written three lines with Tj and T*.
Ts运算符可用于调整文本的垂直位置:
The Ts operator can be used
to adjust the vertical position of text:
BT /F0 72 Tf 1 0 0 1 140 290 Tm (Text) Tj 20 Ts (Up) Tj 0 Ts (and) Tj -20 Ts (Down) Tj ET
结果如图6-3所示。这是我们第一次在Tj
不换行的情况下使用多个运算符。请注意,Tj操作员在显示文本后将文本位置设置为刚刚绘制的字符串的末尾。
The result is shown in Figure 6-3. This is the
first time we’ve used multiple Tj
operators without starting a new line. Note that the Tj operator, having shown the text, sets the
text position to the end of the string which was just drawn.
该运算符是用于绘制具有水平字形调整的字符串TJ的替代方法。Tj这些通常发生在文本在文字处理器或排字机中进行布局时,尤其是在内容完全对齐的情况下。运算符是一种方便的TJ
方式来编码此信息,而无需为每行文本使用数十个运算符:
The TJ operator is an
alternative to Tj for drawing a
string with horizontal glyph adjustments. These typically occur when
text is layed out in a word-processor or typesetter, especially if the
content is fully justified. The TJ
operator is a convenient way to encode this information without using
dozens of operators for each line of text:
BT /F0 72 Tf 90 TL 1 0 0 1 240 330 Tm [(PJ WAYNE)] TJ T* [(P)150(J )(W)150(A)80(YN)20(E)] TJ ET
我们在TJ这里使用了两次;一次正常显示文本,第二次在传递给TJ. 结果如图 6-4 所示。
We have used TJ twice here;
once to show the text as normal, and a second time including manual
kerns in the array passed to TJ.
The result is illustrated in Figure 6-4.
文本有七种渲染模式,由Tr操作符设置。其中四个用于将文本设置为剪切路径,一个用于编写不可见文本。我们这里不考虑这些。其他三个(模式 0、1 和 2)分别用于填充、描边和先填充后描边。颜色设置方式与形状绘制相同:
There are seven rendering modes for text, set with the Tr operator. Four of them are for setting up
text as a clipping path, and one is for writing invisible text. We
don’t consider those here. The other three (modes 0, 1, and 2) are
used for filling, stroking, and filling-followed-by-stroking
respectively. The colors set in the same way as for shape
drawing:
0.5 g BT /F0 72 Tf 1 0 0 1 160 380 Tm 90 TL (Text Mode Zero) Tj T* 1 Tr (Text Mode One) Tj T* 2 Tr (Text Mode Two) Tj ET
结果如图 6-5 所示。
The result is illustrated in Figure 6-5.
字体是 特定 字符集的字形(字符形状)的集合。在 PDF 中,字体由 定义规格、字符集和编码(将文本字符串中的字符代码映射到字体中的字符)的字体字典以及字体程序(即实际的字体文件)组成,多种格式(Type 1、TrueType 等)。
A font is a collection of glyphs (character shapes) for a particular character set. In PDF, a font is composed of a font dictionary which defines the metrics, character set, and encoding (mapping of character codes in text strings to characters in the font), together with the font program (which is the actual font file), in a variety of formats (Type 1, TrueType etc).
PDF 允许使用主要的流行字体格式,以及Type 3 字体,通过使用 PDF 图形运算符的集合直接定义字符形状,允许对任何其他字体类型(例如,传统位图字体)进行编码。
PDF allows the use of the major popular font formats, together with Type 3 fonts which allow the encoding of any other font type (for example, legacy bitmap fonts) by defining the character shapes directly using a collection of PDF graphics operators.
/Type1在字体词典中与字体类型一起引入。Type 1 是一种最初用于 PostScript 的 Adobe 字体格式。标准的 14 种字体被定义为 Type 1 字体。Multiple Master Type 1 字体 ( /MMType1) 是 Type 1 的扩展,允许从一组轮廓自动生成多种字体样式。
Introduced with font type /Type1 in the font dictionary. Type 1 is
an Adobe font format originally for use with PostScript. The
standard 14 fonts are defined as Type 1 fonts. Multiple Master
Type 1 fonts (/MMType1) are an
extension of Type 1 allowing the automatic generation of many font
styles from a one set of outlines.
/TrueType在字体词典中与字体类型一起引入。基于 Apple 的 TrueType 字体格式(在 Microsoft Windows 中也经常使用)。
Introduced with font type /TrueType in the font dictionary. Based
on Apple’s TrueType font format (also frequently used in Microsoft
Windows).
引入字体类型/Type3。这些是由 PDF 图形运算符流组成的字体。这意味着它们可以包括颜色和阴影,因此更加灵活,但没有提示机制以在小尺寸下清晰显示。通常用于模拟其他字体格式(例如,位图字体)。
Introduced with font type /Type3. These are fonts composed of
streams of PDF graphics operators. This means they can include
colors and shadings, so are more flexible, but have no hinting
mechanisms for clear display at small sizes. Often used to emulate
other font formats (for example, bitmap fonts).
这些是复合字体,旨在支持多字节字符集(其中一种字体具有大量字形,例如中文)。本文不讨论它们。
These are composite fonts, intended to support multibyte character sets (where a font has a huge number of glyphs, such as Chinese). They are not discussed in this text.
我们将使用 Type 1 字体作为示例。表 6-4总结了 Type 1 字体字典中的条目。
We will use Type 1 fonts as an example. Table 6-4 summarizes the entries in a Type 1 font dictionary.
表 6-4。Type 1字体词典(*表示必填项,**表示除标准14种字体外均需填写)
Table 6-4. Type 1 font dictionary (*denotes required entry, **denotes required except for the standard 14 fonts)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type* | 姓名 | 一定是/Font。 |
/Subtype* | 姓名 | 一定是/Type1。 |
/BaseFont* | 姓名 | 字体的 PostScript 名称。 |
/FirstChar** | 整数 | /Widths数组中的第一个代码。 |
/LastChar** | 整数 | /Widths数组中的最后一个代码。 |
/Widths** | 整数数组 | 长度数组 ( /LastChar- /FirstChar+ 1),以千分之一的文本空间单位给出这些字符的字形宽度。 |
/FontDescriptor** | 间接引用字典 | 一个字体描述符字典,给出字体的度量(除了字形宽度)。 |
/Encoding | 名字或字典 | 字体的字符编码,例如/MacRomanEncoding或/WinAnsiEncoding。比较复杂的用字典来描述。 |
/ToUnicode | 溪流 | 包含用于提取文本内容的指令的流。请参阅从文档中提取文本。 |
PDF 中有 14 种标准的 Type 1 字体。这些字体的度量和轮廓(或合适的替代字体)必须在任何 PDF 应用程序中可用。然而,如今 Adobe 建议完全嵌入所有字体,甚至包括这些字体。标准字体是:
There are 14 standard Type 1 fonts in PDF. These are fonts where the metrics and outlines (or suitable substitution fonts) must be available in any PDF application. Nowadays, however, Adobe recommends that all fonts are fully embedded, even these. The standard fonts are:
| 时代罗马 |
| 时代-粗体 |
| 斜体 |
| Times-BoldItalic |
| 黑体字 |
| Helvetica-粗体 |
| 斜体字体 |
| Helvetica-BoldOblique |
| 导游 |
| Courier-粗体 |
| 信使倾斜 |
| Courier-BoldOblique |
| 象征 |
| Zapf 标志符号 |
例如,这是一个简单的 Type 1 字体:
For example, here is a simple Type 1 font:
1 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Times-Roman /FirstChar 0 /LastChar 255 /Widths [ 255 255 255 255 ... 744 268 380 380 380 380 380 380 380 380 380 380 ] /FontDescriptor 2 0 R /Encoding /WinAnsiEncoding >>
省略号...是我们省略的内容,不是PDF语言的一部分。我们稍后讨论/FontDescriptor和/Encoding条目。该/Widths数组给出了该字体中 256 个字符中每个字符的文本空间单位的千分之一宽度。
The ellipsis ... is content we
have omitted, not part of the PDF language. We discuss the /FontDescriptor and /Encoding entries later. The /Widths array gives the widths in thousandths
of a text space unit for each of the 256 characters in this font.
字体编码描述了字符代码(内容流中使用的字符串中的字符)和字体中的字形描述之间的映射。字体程序有自己的内置编码,但 PDF 字体可以更改编码以使用带有 Microsoft Windows 编码的 Macintosh 字体,或使用单字节编码从超过 256 个字符的字体中选择最多 256 个字符字形(例如,字符或连字的变体)。
The font encoding describes the mapping between character codes (characters in the strings used in content streams) and glyph descriptions in the font. Font programs have their own built-in encodings, but the PDF font can alter the encoding to use a Macintosh font with a Microsoft Windows encoding, or to use a single-byte encoding to select up to 256 characters from a font with more than 256 glyphs (e.g., variations on characters or ligatures).
最简单的/Encoding条目只是一种标准编码的名称,它在 PDF 标准附录 D 中定义。更复杂的编码是通过使用字典而不是编码名称来定义的。表 6-5总结了该词典中的条目。
The simplest /Encoding entry is
just the name of one of the standard encodings, which are defined in the
PDF Standard, Appendix D. More complicated encodings are defined by
using a dictionary instead of a name for the encoding. The entries in
this dictionary are summarized in Table 6-5.
表 6-5。编码字典中的条目
Table 6-5. Entries in an encoding dictionary
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type | 姓名 | 一定是/Encoding |
/BaseEncoding | 姓名 | 基本编码/Differences,条目从中
定义差异。这是预定义编码/MacRomanEncoding、/MacExpertEncoding或之一/WinAnsiEncoding。如果此条目不存在,则差异来自字体文件的内置编码。 |
/Differences | 整数和名称数组 | 定义与基本编码的差异。包含零个或多个部分,每个部分以数字
n开头,后跟字符 n、n+1、n+2 等的字形名称。例如[6 /endash
/emdash 34 /space]映射6到/endash、7到/emdash和34到/space。 |
在示例 6-1中,字体的编码定义了与内置字体编码的不同之处,即用字符/bullet
(项目符号点)替换字符 1。这意味着 PDF 查看器可以正确地剪切和粘贴文本,因为它现在知道字符代码 1 是一个项目符号点(类似的名称/bullet在Adobe Glyph List中预定义)。它对 PDF 的显示没有影响。
In Example 6-1, the font has an encoding that
defines a difference from the built-in font encoding by replacing
character 1 by the character /bullet
(the bullet point). This means that the PDF viewer can cut and paste the
text properly, because it now knows that character code 1 is a bullet
point (names like /bullet are
predefined in the Adobe Glyph List). It makes no
difference to the display of the PDF.
示例 6-1。添加了项目符号点的字体的字体编码
Example 6-1. A font encoding for a font with the bullet point added
25 0 obj << /Type /Font /Subtype /Type1 /Encoding 23 0 R Reference to the encoding dictionary. /BaseFont /Symbol /ToUnicode 24 0 R Instructions for conversion to Unicode. >> endobj 23 0 obj Encoding dictionary << /Type /Encoding /BaseEncoding /WinAnsiEncoding The base encoding. /Differences [ 1 /bullet ] The differences >> endobj
创建 PDF 文件时,必须 嵌入字体,以便显示 PDF 或以其他方式处理它的程序可以使用字形描述和编码。要嵌入字体:
When creating a PDF file, the fonts must be embedded, so that the glyph descriptions and encodings are available to the program showing the PDF or processing it in other ways. To embed a font:
从字体文件中提取各种细节——这个过程因所讨论的字体格式而异。这些详细信息(度量、编码等)用于填写字体字典、字体度量和字体编码字典。
Various details from the font file are extracted—a process that varies depending upon the font format in question. These details (metrics, encodings etc.) are used to fill out a font dictionary, the font metrics, and the font encoding dictionary.
如果字体格式允许,现在可以从有问题的字体文件中删除这些细节,只留下字形描述——所有这些信息现在都在字体字典中。这会减小嵌入字体的大小。
These details can now be stripped from the font file in question, if that is allowed by the font format, leaving just the glyph descriptions—all this information is now in the font dictionary. This reduces the size of the embedded font.
字体可以被子集化,删除整个字形描述,将字体文件减少到只包含实际使用的字符的文件。例如,仅用于文档标题的字体实际上可能只使用十个字符。根据字体格式,可能必须更改编码以将所有这些字符放在字体中的前几个字符位置,因此它们的编号为 1、2、3……。子集字体可以通过由六个大写字母后跟一个 组成的前缀来标识+,例如RTFGRF+. 在创建子集时会生成此唯一代码,以允许将不同的子集彼此区分开来。
The font may be subsetted, removing whole
glyph descriptions, reducing the font file to one which holds only
the characters which are actually used. For example, a font only
used for the title of a document may only actually use ten
characters. Depending on the font format, the encoding may have to
be altered to place all these characters in the first few character
positions in the font so they are numbered 1,2,3…. Subset fonts may
be identified by a prefix formed of six uppercase letters followed
by a +, such as RTFGRF+. This unique code is generated
when the subset is created to allow different subsets to be
distinguished from one another.
例 6-2给出了一个嵌入字体的例子。
An example of an embedded font is given in Example 6-2.
示例 6-2。嵌入字体,包括编码和字体描述符
Example 6-2. An embedded font, including encoding and font descriptor
9 0 obj <</Type /Font /Subtype /TrueType It's a TrueType font /BaseFont /GCCBBY+TT8Et00 Font is TT8Et00. GCCBBY+ prefix identifies as a subset font. /FontDescriptor 8 0 R /FirstChar 1 There are 41 characters in this font. /LastChar 41 /Widths [603 603 603 603 603 603 603 603 603 603 603 603 603 603 The widths. It's a fixed-width font. 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603 603] /Encoding 14 0 R >> 14 0 obj The font encoding. << /Type /Encoding /BaseEncoding /WinAnsiEncoding The base encoding /Differences The changes. In this case, it's a subset font with the characters at position 1 onward. [1 /w /i /d /g /e /t /s /T /h /space /r /u /l /a /x /bracketleft /underscore /J /o /n /S /m /quotesingle /A /p /c /bracketright /one /colon /braceleft /b /k /braceright /v /period /parenleft /two /parenright /asterisk /y /P] >> endobj 8 0 obj The font descriptor, giving the remaining metrics. << /Type /FontDescriptor /FontName /GCCBBY+TT8Et00 /FontBBox [0 -205 602 770] /Flags 4 /Ascent 770 /CapHeight 770 /Descent -205 /ItalicAngle 0 /StemV 90 /MissingWidth 602 /FontFile2 12 0 R The actual font file, here in TrueType format. >> endobj
此处不讨论实际字体格式(Type1、TrueType 等)的细节——事实上,它们也不在 PDF 标准中讨论,而是通过这些字体格式提供商的外部文档进行讨论。
The details of the actual font formats (Type1, TrueType etc.) are not discussed here—in fact, they are not discussed in the PDF Standard either, but by external documents from the providers of those font formats.
通常在文件的字体字典中包含足够的信息以允许检索实际的字符标识(而不仅仅是字形)。这对于允许用户从 Adobe Reader 等 PDF 查看应用程序中搜索和复制文本非常重要。在更有限的能力下,也可以使用 in 来允许对文档的文本内容进行编辑。
It is customary to include enough information in a file’s font dictionaries to allow the actual character identities (rather than just the glyphs) to be retrieved. This is important to allow users to search and copy text from PDF viewing applications like Adobe Reader. In can also be used, in a more limited capacity, to allow edits to be made to the textual content of a document.
有两种机制:/Encoding字体中的条目(将字符代码映射到 Adobe Glyph List 条目,/bullet如
/ToUnicode实体。下面是一个/ToUnicode程序示例:
There are two mechanisms for this: the /Encoding entry in the font (which maps
character codes to Adobe Glyph List entries like /bullet), and a more modern mechanism, the
/ToUnicode entry which provides a
program in a language defined by Adobe which maps character codes directly
to Unicode entities. Here is an example of a /ToUnicode program:
23 0 obj
<< /Length 317 >>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Symbol+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /Symbol+0 def
1 begincodespacerange <01> <01> endcodespacerange
1 beginbfrange
<01> <01> <2022> Maps character code 1 to Unicode U+2022, the bullet point
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
endstream
endobj文本提取的另一个困难是在内容流中重建文本运算符。操作员可能会拆分文本以进行字距调整或对齐,行尾的连字符可能会打断字符流。实际上,文本运算符甚至有可能乱序。不过,通常情况下,可以从大多数现代文件中生成良好的文本重建。
Another hardship in the extraction of text is reconstructing the text operators within the content stream. Operators may split up the text for kerning or justification, and hyphenation at the end of lines can interrupt the stream of characters. Indeed, it is even possible that the text operators may be out of order. Usually, though, a good reconstruction of text may be produced from most modern files.
除了 PDF 标准外,还有许多其他文档提供了本章所讨论主题的更多详细信息:
As well as the PDF Standard, there are a number of other documents which provide further detail on the topics discussed in this chapter:
Unicode 在 The Unicode Consortium 发布的 The Unicode Standard, Version 5.0 中有完整的描述。更容易理解的介绍是 O'Reilly 自己的Unicode Explained by Jukka K. Korpela。
Unicode is described fully in The Unicode Standard, Version 5.0, published by The Unicode Consortium. A more digestible introduction is O’Reilly’s own Unicode Explained by Jukka K. Korpela.
Yannis Haralambous (O'Reilly) 的 Fonts and Encodings 解释了 PDF 使用的各种字体格式。
Fonts and Encodings by Yannis Haralambous (O’Reilly) explains the various font formats used by PDF.
Adobe 字体和字体技术中心收集了各种字体格式和编码系统的历史和当前文档,包括用于编码外语的前 Unicode 方法。
The Adobe Font and Type Technology Center is a collection of historic and current documents for the various font formats and encoding systems, including pre-Unicode methods for encoding foreign languages.
在本章中,我们讨论了四个主题,这些主题与 PDF 文档的视觉外观无关,但与文档的交互、屏幕使用可能包含的辅助数据以及用于携带文档的额外信息以供使用的元数据有关通过 PDF 工作流程中的程序。
In this chapter, we discuss four topics related not to the visual appearance of a PDF document, but to the ancillary data which may also be included for interactive, onscreen use of documents, and the metadata used to carry extra information with a document for use by programs in a PDF workflow.
定义文件中位置的数据结构。它们可用于指定书签或 超链接指向的位置。书签(恰当地称为文档大纲)用作文档的目录。
Data structures defining a position within a file. They can be used to specify where a bookmark or hyperlink points to. Bookmarks (properly called the document outline) are used as a table of contents for the document.
包含指定格式的 XML 文件的流,包含一些与文档信息字典相同的元数据,以及其他字段。
A stream containing an XML file in a specified format, containing some of the same metadata as the document information dictionary, together with additional fields.
允许将整个文件封装在文档中,就像电子邮件附件一样。
Allow whole files to be encapsulated in a document, much like an email attachment.
允许将文本和图形应用于 PDF 页面的顶部,与主页内容分开,以供屏幕阅读器显示。一种特殊类型的注释是 超链接,它允许用户单击页面上的某处并被重定向到文件中其他地方的目标。
Allow text and graphics to be applied on top of a PDF page, separate from the main page content, for display by onscreen readers. One particular kind of annotation is the hyperlink, which allows a user to click somewhere on a page and be redirected to a destination elsewhere in the file.
文档的书签(恰当地称为 文档大纲)是一个条目树(通常是章节、节、段落等的标题),可以在 PDF 查看器中单击它们以在文档中移动。每个条目都有一些文本和 描述其链接位置的目的地。
A document’s bookmarks (properly called the document outline) are a tree of entries (typically titles of chapters, sections, paragraphs etc.) which can be clicked on in a PDF viewer to move around the document. Each entry has some text and a destination describing where it links to.
目标定义了 PDF 文件中的位置,包括页码、页面内的位置以及查看该页面时使用的放大率。目的地可以明确定义(为简单起见,我们将这样做)或通过名称引用并在 列出所有目的地的文档范围名称树中查找。书签通常显示在 PDF 查看器中的文档旁边。
A destination defines a place in a PDF file, consisting of the page number, position within that page, and magnification to use when viewing that page. Destinations may be defined explicitly (as we will do for simplicity) or referenced by a name and looked up in a document-wide name tree listing all destinations. The bookmarks are typically displayed alongside the document in a PDF viewer.
目的地是使用数组对象定义的,其内容取决于目的地的类型。表 7-1总结了目标句法。
Destinations are defined using an array object, with the contents depending upon the kind of destination. Destination syntax is summarized in Table 7-1.
表 7-1。目的地的语法。“页面”是对页面对象的间接引用。除非另有说明,否则目的地使用裁剪框(如果没有裁剪框,则使用媒体框)。
Table 7-1. Syntax for destinations. “page” is an indirect reference to a page object. Destinations use the crop box (or media box if there is no crop box) unless otherwise specified.
| 大批 | 描述 |
|---|---|
[页 /Fit] | 以刚好适合窗口中整个页面的水平和垂直比例显示页面。 |
[页面 /FitH 顶部] | 在窗口的上边缘显示垂直坐标为 top的页面,并将放大倍数设置为水平适合文档。 |
[ 左页/FitV
] | 显示页面,水平坐标 位于窗口的左边缘,放大率设置为垂直适合文档。 |
[页面 /XYZ 左上角
缩放_ ] | 在窗口的左上角显示带有 ( left , top ) 的页面,并将页面放大zoom倍数。任何参数的空值表示没有变化。 |
[页面 /FitR 左下
右上_ _
] | 显示缩放的页面以显示由left、bottom、 right和 top指定的矩形。 |
[页 /FitB] | 像 一样显示页面/Fit,但使用页面内容的边界框,而不是裁剪框。 |
[页面 /FitBH
顶部] | 像 一样显示页面/FitH,但使用页面内容的边界框,而不是裁剪框。 |
[ 左页/FitBV
] | 像 一样显示页面/FitV,但使用页面内容的边界框,而不是裁剪框。 |
文档大纲由大纲字典和许多
大纲项字典定义的大纲条目树组成。/Outlines大纲词典由文档目录中的条目指向。条目的子条目(子项)可以默认显示 ( open ) 或默认隐藏并且仅通过单击 ( closed ) 显示。大纲词典总结在表7-2和7-3中。
The document outline consists of a tree of outline entries defined
by an outline dictionary and a number of
outline item dictionaries. The outline dictionary
is pointed to by the /Outlines entry
in the document catalog. The subentries (children) for an entry may be
shown by default (open) or concealed by default
and only revealed by clicking (closed). The
outline dictionaries are summarized in Tables 7-2 and 7-3.
表 7-2。大纲词典中的条目
Table 7-2. Entries in an outline dictionary
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type | 姓名 | 如果存在,则必须是/Outlines。 |
/First | 间接引用字典 | 文档大纲中第一个顶级项目的大纲项目字典。如果存在任何文档大纲条目,则为必需。 |
/Last | 间接引用字典 | 文档大纲中最后一个顶级项目的大纲项目字典。如果存在任何文档大纲条目,则为必需。 |
/Count | 整数 | 大纲所有部分中打开的大纲条目总数。如果没有开放条目,可以省略。 |
表 7-3。大纲项目字典中的条目 * 表示必填项
Table 7-3. Entries in an outline item dictionary *denotes a required entry
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Title* | 文本字符串 | 要为此条目显示的文本。 |
/Parent* | 间接引用字典 | 指向大纲树中此项的父项的指针。另一个大纲项字典或顶级大纲字典。 |
/Prev | 间接引用字典 | 指向此级别的上一项(如果有的话)的指针。 |
/Next | 间接引用字典 | 指向此级别的下一项(如果有的话)的指针。 |
/First | 间接引用字典 | 指向此条目的第一个子项(如果有)的指针。 |
/Last | 间接引用字典 | 指向此条目的最后一个子项的指针,如果它有的话。 |
/Count | 整数 | 如果此条目已打开,则此条目下方的打开条目数。如果关闭,则为负整数,其绝对值为用户打开此项目时将显示的后代数。 |
/Dest | 名称、字符串或数组 | 目的地。数组是目的地,名称是对/Dests文档目录条目中条目的引用,字符串是对/Dests文档名称字典条目中条目的引用。 |
考虑一个包含三页的文件。我们希望建立以下层次结构:
Consider a file with three pages. We wish to build the following hierarchy:
Part 1 (指向第一页)
Part 1 (points
to page one)
Part 1A
(指向第二页)
Part 1B
(指向第三页)
Part 1A
(points to page two)
Part 1B
(points to page three)
生成的代码如示例 7-1所示。本文档中的页面对象的第一页、第二页和第三页的对象编号分别为 3、5 和 7。对象 12 是文档目录。对象11是文档大纲字典,对象8、9、10是文档大纲项字典。
The resultant code is shown in Example 7-1. The page objects in this document have object numbers 3, 5, and 7 for pages one, two and three respectively. Object 12 is the document catalog. Object 11 is the document outline dictionary, and objects 8, 9, and 10 are document outline item dictionaries.
示例 7-1。示例文档大纲
Example 7-1. An example document outline
8 0 obj << /Parent 10 0 R /Title (Part 1B) /Dest [ 7 0 R /Fit ] /Prev 9 0 R >> endobj 9 0 obj << /Parent 10 0 R /Title (Part 1A) /Dest [ 5 0 R /Fit ] /Next 8 0 R >> endobj 10 0 obj << /Parent 11 0 R /First 9 0 R /Dest [ 3 0 R /Fit ] /Title (Part 1) /Last 8 0 R >> endobj 11 0 obj << /First 10 0 R /Last 10 0 R >> endobj 12 0 obj << /Outlines 11 0 R /Pages 1 0 R /Type /Catalog >>
Adobe Reader 显示文档及其大纲, 如图 7-1所示。
Adobe Reader displays the document and its outline as shown in Figure 7-1.
从 PDF 1.4 开始,元数据流可用于将 XML 元数据附加到整个文档或其中的单个元素。文档级元数据流扩展并取代了文档信息字典(为了与旧的 PDF 程序兼容,它几乎总是包含在内)。
Starting with PDF 1.4, metadata streams can be used to attach XML metadata to the whole document, or to individual elements within it. Document level metadata streams extend and supersede the document information dictionary (which is almost always included for compatibility with older PDF programs).
元数据以未压缩和(通常)未加密的方式存储,以这样一种方式,不了解 PDF 的外部工具可以轻松地在 PDF 文件中找到它。
The metadata is stored uncompressed and (typically) unencrypted, and in such a way that external tools which don’t know about PDF can find it within a PDF file easily.
XML 使用由 Adobe 的XMP:可扩展元数据平台中描述的可扩展元数据平台 (XMP) 定义的标记。这种格式包括一种以独立于平台的方式将元数据嵌入其他格式(例如 PDF)的方法,以便无法理解封闭格式的程序仍然可以提取 XMP 数据。XMP 格式的完整详细信息位于Adobe 的网站上。
The XML uses markup defined by the Extensible Metadata Platform (XMP) which is described in Adobe’s XMP: Extensible Metadata Platform. This format includes a method of embedding the metadata in other formats (e.g., PDF) in a platform-independent way so that programs which cannot understand the enclosing format can still extract the XMP data. Full details of the XMP Format are on Adobe’s website.
示例 XMP 元数据如示例 7-2所示。您可以从文档信息词典中看到一些熟悉的条目。/Type
/Metadata /Subtype /XML另请注意将此流标识为 XMP 元数据的序列。通过使用/Metadata文档目录中的条目将元数据流添加到文档中。
Example XMP metadata is shown in Example 7-2.
You can see some of the familiar entries from the document information
dictionary. Note also the sequence /Type
/Metadata /Subtype /XML which identifies this stream as XMP
metadata. A metadata stream is added to a document by using the /Metadata entry in the document catalog.
示例 7-2。ISO PDF 格式参考手册 PDF 的 XML 元数据。↵ 符号用于指示无需回车即可继续的行。␣ 符号用于表示空格字符。
Example 7-2. XML Metadata for the ISO PDF Format reference manual PDF. The ↵ symbol is used to indicate a line which continues without a carriage return. The ␣ symbol is used to represent a space character.
4884␣0␣obj<</Length␣3508/Type/Metadata/Subtype/XML>>stream
<?xpacket␣begin=''␣id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters␣esc="CRLF"?>
<x:xmpmeta␣xmlns:x='adobe:ns:meta/'␣x:xmptk='XMP␣toolkit␣2.9.1-14,␣framework␣1.6'>
<rdf:RDF␣xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'↵
xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description␣rdf:about='uuid:b8659d3a-369e-11d9-b951-000393c97fd8'↵
␣xmlns:pdf='http://ns.adobe.com/pdf/1.3/'↵
␣pdf:Producer='Acrobat␣Distiller␣6.0.1␣for␣Macintosh'>↵
</rdf:Description>
<rdf:Description␣rdf:about='uuid:b8659d3a-369e-11d9-b951-000393c97fd8'↵
␣xmlns:xap='http://ns.adobe.com/xap/1.0/'↵
␣xap:CreateDate='2004-11-14T08:41:16Z'↵
␣xap:ModifyDate='2004-11-14T16:38:50-08:00'↵
␣xap:CreatorTool='FrameMaker␣7.0'↵
␣xap:MetadataDate='2004-11-14T16:38:50-08:00'>↵
</rdf:Description>
<rdf:Description␣rdf:about='uuid:b8659d3a-369e-11d9-b951-000393c97fd8'↵
␣xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/'↵
␣xapMM:DocumentID='uuid:919b9378-369c-11d9-a2b5-000393c97fd8'/>
<rdf:Description␣rdf:about='uuid:b8659d3a-369e-11d9-b951-000393c97fd8'↵
␣xmlns:dc='http://purl.org/dc/elements/1.1/'↵
␣dc:format='application/pdf'>↵
<dc:description><rdf:Alt>↵
<rdf:li␣xml:lang='x-default'>␣Adobe␣Portable␣Document␣Format␣(PDF)␣</rdf:li>↵
</rdf:Alt></dc:description>↵
<dc:creator>␣<rdf:Seq>␣<rdf:li>↵
Adobe␣Systems␣Incorporated␣</rdf:li>␣</rdf:Seq>␣</dc:creator>↵
<dc:title>␣<rdf:Alt>↵
<rdf:li␣xml:lang='x-default'>PDF␣Reference,␣version␣1.6␣</rdf:li>␣</rdf:Alt>↵
</dc:title></rdf:Description>↵
</rdf:RDF>
</x:xmpmeta>
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣
(Many more lines of padding)
<?xpacket␣end='w'?>
endstream
endobj注释在 PDF 中用于在页面内容本身之外添加注释或交互元素。每个查看器应用程序(例如 Adobe Reader 或 Mac OS X Preview)可能会以不同的方式显示这些注释,甚至在软件版本之间也会发生变化,因此无法依赖确切的视觉效果。注释不影响打印输出。
Annotations are used in PDF to add comments or interactive elements outside of the page content itself. Each viewer application (for example Adobe Reader or Mac OS X Preview) may display these annotations in a different way, even changing between software versions, so the exact visual effect cannot be relied upon. The annotations do not affect the printed output.
/Annots可以使用页面字典中条目下的数组将一个或多个注释与每个页面相关联。每个注解都是一本字典。表 7-4中描述了更重要的条目。每种类型的注释在该字典中都有附加条目。
One or more annotations may be associated with each page using an
array under the entry /Annots in the
page dictionary. Each annotation is a dictionary. The more important
entries are described in Table 7-4. Each
type of annotation has additional entries in this dictionary.
表 7-4。注释字典中的条目(*表示必需的条目)
Table 7-4. Entries in an annotation dictionary (*denotes required entry)
| 钥匙 | 值类型 | 价值 |
|---|---|---|
/Type | 姓名 | 如果存在,则必须是/Annot。 |
/Subtype* | 姓名 | 此注释的类型。 |
/Rect* | 矩形 | 默认用户空间单位中注释的位置和大小。 |
/Contents | 文本字符串 | 此注释的文本内容,如果没有,则为替代的人类可读描述。 |
我们将了解两种注释:可用于添加注释 的文本注释和用于在文档中创建超链接的链接注释。还有许多其他类型可用于在文档上绘图、突出显示文本和添加打印机标记。在File Attachments中,我们使用文件附件注释将附件添加到各个页面。
We’ll look at two kinds of annotations: text annotations which can be used to add comments, and link annotations which are used to make hyperlinks within a document. There are many other types for drawing on the document, highlighting text and adding printer’s marks. In File Attachments, we use file attachment annotations to add attachments to individual pages.
首先,文本注释。在这里,/Subtype是/Text。代码如例 7-3所示。我们将额外的注释字典条目
/Open设置true为指示打开文档时注释将可见。/C条目的背景颜色设置为白色
。
First, a text annotation. Here, the /Subtype is /Text. The code is shown in Example 7-3. We set the extra annotation dictionary entry
/Open to true to indicate the note will be visible when
the document is opened. The background color is set to White with the
/C entry.
示例 7-3。文本注释
Example 7-3. A Text annotation
6 0 obj << /Subtype /Text /Open true /Contents (An example text annotation) /Type /Annot /Rect [400 100 500 200] /C [1 1 1] RGB (1, 1, 1) i.e., White >> /Annots [6 0 R] Extra entry in page dictionary
Adobe Reader 中的结果如图 7-2所示。请注意,Adobe Reader 会忽略
/Rect此处的条目——其他查看者可能会使用它。
The result in Adobe Reader is shown in Figure 7-2. Note that Adobe Reader ignores the
/Rect entry here—other viewers may use
it.
现在,让我们尝试链接注释,以构建从第一页到第三页的超链接。链接注释具有子类型/Link和/Dest给出目的地的条目(在
Destinations中描述)。该/Rect条目定义超链接的区域。
Now, let’s try a link annotation, to build a hyperlink from page one
to page three. A link annotation has subtype /Link and a /Dest entry giving the destination (described in
Destinations). The /Rect entry defines the area of the
hyperlink.
代码如例 7-4所示。
The code is shown in Example 7-4.
示例 7-4。链接注释
Example 7-4. A link annotation
6 0 obj
<<
/Subtype /Link
/Dest [4 0 R /Fit]
/Type /Annot
/Rect [45 760 260 800]
>>
/Annots [6 0 R] Extra entry in page dictionaryAdobe Reader 中的结果如图 7-3所示。
The result in Adobe Reader is shown in Figure 7-3.
附件是一种在 PDF 文档中包含一个或多个文件(任何类型)的方式。文件可以作为一个整体或单独的页面附加到文档中。通常,PDF 查看器将显示所有附件的列表,允许用户打开或保存它们。例如,此工具可用于将示例资源与幻灯片演示文稿的 PDF 捆绑在一起。
An attachment is a way of including one or more files (of any type) within a PDF document. Files may be attached to the document as a whole, or to individual pages. Typically, the PDF viewer will display a list of any attachments, allowing the user to open or save them. This facility could be used, for example, to bundle example resources along with a PDF of a slide-show presentation.
嵌入文件本身简单地包含在流对象中,
/Type /EmbeddedFile作为流字典中的附加条目。示例嵌入文件的代码如示例 7-5所示。
The embedded file itself is simply included in a stream object, with
/Type /EmbeddedFile as an additional
entry in the stream dictionary. The code for a sample embedded file is
shown in Example 7-5.
示例 7-5。嵌入文件
Example 7-5. An embedded file
8 0 obj << /Type /EmbeddedFile /Length 35 >> stream This is a text file attachment... endstream endobj
嵌入式文件流以两种截然不同的方式被引用:一种用于整个文档的附件,另一种用于特定页面的附件。
The embedded file stream is referenced in two quite different ways: one for attachments to the whole document, another for attachments to particular pages.
为了附加到整个文档,一个条目包含在
文档目录中条目/EmbeddedFiles引用的名称字典中。/Names代码如例 7-6所示。
To attach to the whole document, an /EmbeddedFiles entry is included in the name
dictionary referenced by the /Names
entry in the document catalog. The code is shown in Example 7-6.
示例 7-6。文档级别附件的 PDF 代码。嵌入文件是对象 8(参见示例 7-5)。
Example 7-6. PDF Code for an attachment at the document level. The embedded file is object 8 (see Example 7-5).
9 0 obj
<< /Names
<< /EmbeddedFiles
<< /Names
[ (attachment.txt) << /EF << /F 8 0 R >> /F (attachment.txt) /Type /F >> ] >>
>>
/Pages 1 0 R
/Type /Catalog >>
endobj/Annots要附加到单个页面,使用一种特殊的注释,在页面字典中的字典中照常列出。代码如例 7-7所示。
To attach to a single page, a special kind of annotation is used,
listed as usual in the /Annots dictionary in the page dictionary.
The code is shown in Example 7-7.
示例 7-7。特定页面附件的 PDF 代码。嵌入文件是对象 8(参见示例 7-5)。
Example 7-7. PDF code for an attachment to a particular page. The embedded file is object 8 (see Example 7-5).
9 0 obj
<<
/Type /Page
(Other dictionary entries as usual)
/Annots
[ << /FS << /EF << /F 8 0 R >> /F (attachment.txt) /Type /F >>
/Subtype /FileAttachment
/Contents (attachment.txt)
/Rect [ 18 796.88976378 45 823.88976378 ]
>> ]
>>
endobjAdobe Reader 在边栏中显示附件 如图 7-4所示。
Adobe Reader’s display of the attachment in a sidebar is shown in Figure 7-4.
从 PDF 1.1 版开始,PDF 文档可以使用各种行业标准方案进行加密,这些方案多年来在复杂性和安全性方面有所增加。此外,PDF 标准还提供了用于封装第三方加密和安全策略的通用机制。
PDF documents can be encrypted using a variety of industry-standard schemes which have increased in complexity and security over the years, starting with PDF version 1.1. The PDF standard provides, in addition, a general mechanism for encapsulating third-party encryption and security policies.
除了少数例外,加密适用于文件中的流和字符串,但不加密数字或其他 PDF 数据类型,也不加密整个文件。因此,文档的对象结构在不需要解密的情况下对应用程序保持可见,但文档的实质内容得到保护。
Encryption applies, with a few exceptions, to streams and strings in the file, but does not encrypt numbers or other PDF data types, nor does it encrypt the file as a whole. Thus, the document’s object structure remains visible to applications without the need for decryption, but the substantive content of the document is safeguarded.
更现代的 PDF 加密方法允许文件的 XMP 元数据流 ( XML 元数据) 保持未加密状态,因此不知道如何打开加密 PDF 文件或不知道密码的程序可能会提取和读取它。
The more modern PDF encryption methods allow the file’s XMP metadata stream (XML Metadata) to be left unencrypted so it may be extracted and read by programs which don’t know how to open encrypted PDF files, or if the password is not known.
由于加密文档的复杂性,无法手动构建示例(正如我们在其他章节中所做的那样),但我们可以使用 pdftk将我们的标准hello.pdf文件处理为加密文件encypted.pdf:
Due to the complexity of encrypted documents, it isn’t possible to manually build an example (as we have in other chapters), but we can use pdftk to process our standard hello.pdf file into an encrypted one, encypted.pdf:
pdftk hello.pdf output encrypted.pdf encrypt_40bit owner_pw fred
这将使用 40 位 RC4 方法创建输出文件encrypted.pdf ,所有者密码为“ fred”。所有者密码是文件的主密码。拥有它的人可以对该文件做任何事情,包括重新加密它或更改安全设置。用户密码允许用户在文件被加密时执行所有者定义的某些操作(查看文档、打印文档等)。
This creates the output file encrypted.pdf using the 40-bit RC4 method with
an owner password of “fred”. The owner
password is the master password for the file. Someone who has
it can do anything with the file, including re-encrypting it or changing
the security settings. The user password allows the
user to perform certain actions (view the document, print the document
etc.) defined by the owner when the file was encrypted.
在我们的示例中,我们使用了一个很常见的空白用户密码。这意味着文件会立即在 PDF 查看器中打开,无需输入任何密码。我们已禁止用户执行除查看文件以外的任何操作(有关权限和不同加密类型的pdftk语法的详细信息,请参阅加密和解密)。
In our example, we’re using a blank user password, which is very common. This means the file opens right away in a PDF viewer, without any password being entered. We’ve banned the user from doing anything other than viewing the file (see Encryption and Decryption for details of the pdftk syntax for permissions and different encryption types).
当文件在 Adobe Reader 中打开时,唯一值得注意的变化是它(SECURED)附加到窗口的标题栏。通过打开File...Properties窗口并选择
Security选项卡,可以查看安全属性——见图 8-1。通过单击按钮调出图 8-2Show Details...所示的窗口,可以获得更具技术意义的显示。
When the file is opened in Adobe Reader, the only noticeable change
is that (SECURED) is appended to the
window’s title bar. By opening the File...Properties window, and choosing the
Security tab, the security properties
can be viewed—see Figure 8-1. A more
technically-minded display is obtained by clicking on the Show Details... button to bring up the window
shown in Figure 8-2.
如果使用可编辑PDF文件的程序,如Adobe Acrobat,用户在尝试任何权限不允许的编辑操作时,会提示输入所有者密码,如图8-3所示。
If using a program which can edit PDF files, such as Adobe Acrobat, the user will the prompted for the owner password upon attempting any editing operation not allowed by the permissions, as shown in Figure 8-3.
如果文档的用户密码非空,则打开文件时会出现类似的对话框,如图 8-4所示。如果不知道密码,则无法打开文件,甚至无法查看。
A similar dialog is presented upon opening the file if the document has a non-blank user password, as shown in Figure 8-4. If the password is not known, the file cannot be opened, even for viewing.
示例 8-1显示了我们的新文件的内容。看看您是否能发现与示例 2-2中的标准hello.pdf文件的不同之处。
Example 8-1 shows the content of our new file. See if you can spot the differences from the standard hello.pdf file in Example 2-2.
例 8-1。加密文件
Example 8-1. An encrypted file
%PDF-1.1 %âãÏÓ 1 0 obj << /Kids [2 0 R] /Type /Pages /Count 1 >> endobj 3 0 obj << /Length 72 >> stream (72 bytes of encrypted data) endstream endobj 2 0 obj << /Rotate 0 /Parent 1 0 R /Resources << /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >> /MediaBox [0.000000 0.000000 595.275590551 841.88976378] /Type /Page /Contents [3 0 R] >> endobj 4 0 obj << /Pages 1 0 R /Type /Catalog >> endobj 5 0 obj The encryption dictionary << /R 2 /P -64 /O (ífff÷ÚÉMº]Òq)ȢϺA»fgygy^ÏynÔZ¾gtëÙ) /Filter /Standard /V 1 /U (gdË^Wîg:lÆr({M8®qµG9Tæ$YTscåGùLÂÐþ¬) >> endobj xref 0 6 0000000000 65535 f 0000000015 00000 n 0000000199 00000 n 0000000074 00000 n 0000000427 00000 n 0000000478 00000 n trailer << /Encrypt 5 0 R Reference to encryption dictionary at object 5 /Root 4 0 R /Size 6 /ID [<a7d625071f5b223d97922e9e6c3fff23><e546c20487a77c4156083bf56f69bb4d>] >> startxref 617 %%EOF
再看一下例 8-1。已包含一个加密字典(对象 5)并由/Encrypt尾部字典中的条目引用。在这种情况下,该加密字典包含:
Look again at Example 8-1. An encryption
dictionary has been included (object 5) and referenced by the /Encrypt entry in the trailer dictionary. This
encryption dictionary contains, in this instance:
/R和/V条目一起定义要使用的加密算法。
The /R and /V entries which, together, define which
encryption algorithms are to be used.
/P条目,它是一个位字段,指示附加到用户密码使用的权限(查看、打印等)。
The /P entry, which is a
bitfield indicating the permissions (view, print etc.) which are
attached to the use of the user password.
和条目分别用于验证所有者和用户密码/O。/U
The /O and /U entries which are used to verify the
owner and user passwords respectively.
用于 Adobe 安全方法的/Filter条目
。/Standard
The /Filter entry which is
/Standard for Adobe security
methods.
提供的标准加密方法是:
Standard encryption methods provided are:
| 40 位 RC4 (PDF 1.1) |
| 128 位 RC4 (PDF 1.4) |
| 128 位 AES 加密 (PDF 1.5) |
| 256 位 AES 加密 (PDF 1.7 ExtensionLevel 3) |
40 位 RC4 的权限位域(将要介绍的第一种方法)允许一个/P条目允许打印、文档修改、文本和图形提取以及注释的组合。128 位 RC4 和更高版本的方法允许更多权限选项。
The permissions bitfield for 40-bit RC4 (the first method to be
introduced) allows for a /P entry
allowing a combination of printing, modification of the document,
extraction of text and graphics, and annotation. The 128-bit RC4 and later
methods allow more permission options.
权限由 ISO 标准以散文形式描述,因此不能依赖不同 PDF 处理程序对其实施的一致性。
The permissions are described in prose by the ISO standard and so the consistency of their implementation by different PDF processing programs cannot be relied upon.
任何加密文件都可以像往常一样读取,并解析为对象图,而不考虑其加密。/Encrypt然后我们可以通过检查预告片字典中是否存在条目来检查它是否加密。然后,我们尝试使用空白用户密码解密文件:
Any encrypted file may be read as usual, and parsed into an object
graph, without regard to its encryption. We can then inspect it for
encryption by checking for the existence of an /Encrypt entry in the trailer dictionary. Then,
we try to decrypt the file using the blank user password:
读取加密字典的内容,并确定加密类型。
The contents of the encryption dictionary are read, and the encryption type determined.
用户密码已通过身份验证(使用单向算法对其进行处理,并与/U加密字典中的条目进行比较)。
The user password is authenticated (it is processed using a
one-way algorithm, and compared with the /U entry in the encryption
dictionary).
使用进一步的算法,计算加密密钥。
Using a further algorithm, an encryption key is calculated.
此密钥用于解密文件中的每个流和字符串。这可以一次完成,或者更有效地仅在实际需要某个对象时完成。
This key is used to decrypt each stream and string in the file. This can be done all at once or, more efficiently, only when an object is actually needed.
读取权限,并在对文件执行的任何进一步操作中强制执行。
The permissions are read, and enforced in any further operations done on the file.
用于每个步骤的实际算法取决于所使用的加密类型。如果用户密码不为空,则使用相同的过程,而是使用用户输入的密码。
The actual algorithm used for each step depends upon the kind of encryption in use. The same process is used if the user password is non-blank, using the password entered by the user instead.
要使用所有者密码解密,遵循类似的过程,只是不需要应用权限。如果使用用户密码打开文件,然后输入所有者密码,权限可能会放宽。
To decrypt using the owner password, a similar process is followed, except that the permissions need not be applied. If the file is opened with the user password and later, the owner password is entered, the permissions may be relaxed.
要将已解析的 PDF 写入加密文件:
To write a parsed PDF to a file with encryption:
/U和/O条目是根据一种结合所有者和用户密码的单向算法计算的。
The /U and /O entries are calculated based a one-way
algorithm combining the owner and user passwords.
构建加密字典中的其余条目,包括权限,并将加密字典添加到尾部字典中。
The rest of the entries in the encryption dictionary are built, including the permissions, and the encryption dictionary is added to the trailer dictionary.
文件中的每个字符串和流都使用从加密字典计算出的密钥进行加密。
Each string and stream in the file is encrypted using a key calculated from the encryption dictionary.
PDF 对象图以通常的方式扁平化为文件。
The PDF object graph is flattened to a file in the usual fashion.
同样,每个阶段涉及的实际算法因使用的加密方法而异。
Again, the actual algorithms involved at each stage vary with the encryption method in use.
如果文件的权限允许仅使用用户密码对其进行编辑,那么我们必须能够写入修改后的文件,并且仍然使用相同的所有者和用户密码进行加密。但是,上面给出的算法需要知道所有者密码才能再次加密文件以进行写入。
If the permissions on a file allow it to be edited with just the user password, we must be able to write the modified file, still encrypted with the same owner and user password. However, the algorithms given above would require the owner password to be known to encrypt the file again for writing.
为了解决这个问题,文件原始读取的加密参数被保留,即使加密字典本身必须在文件被解密后被删除。/O因此
/U可以重建加密字典(包括条目)。
To solve this problem, the encryption parameters from the original
reading of the file are retained, even though the encryption dictionary
itself must be removed once the file is decrypted. The encryption
dictionary (including the /O and
/U entries) may therefore be
reconstructed.
Pdftk是一个基于iText库(在iText for Java and C#中描述)构建的多平台命令行工具。它具有合并、拆分和标记文档以及设置和读取元数据的功能。
Pdftk is a multiplatform command-line tool built on the iText library (which is described in iText for Java and C#). It has facilities for merging, splitting, and stamping documents, and for setting and reading metadata.
Pdftk有一个有点不寻常的命令行界面,其中元素通常必须以特定顺序出现。我们可以按照指定的顺序将它们分成四组:
Pdftk has a somewhat unusual command-line interface, where elements often have to appear in a particular order. We can split them into four groups, in the order they are specified:
输入文件或文件,以及可能的输入密码。
The input file or files, and possible input passwords.
操作及其所需的任何参数。
The operation and any arguments it requires.
输出和任何输出密码和权限。
The output and any output passwords and permissions.
杂项输出和其他选项。
Sundry output and other options.
完整的细节可以在pdftk的手册中找到——在本章中,我们只给出了示例所需的子集。
The full details can be found in the manual for pdftk—in this chapter, we give only the subset needed for our examples.
要合并文档,我们使用该cat操作。这是默认操作,所以我们实际上不需要指定cat关键字。例如,要将三个文件的页面合并为一个,按顺序,我们需要:
To merge documents, we use the cat operation. This is the default operation, so
we don’t actually need to specify the cat keyword. For example, to merge the pages of
three files into one, in order, we need:
pdftk file1.pdf file1.pdf file3.pdf output output.pdf
这会向output.pdf写入一个新文件,其中依次包含file1.pdf、file2.pdf和file3.pdf的所有页面 。输出文件可能与任何输入文件都不相同。
This writes a new file to output.pdf containing all the pages of file1.pdf, file2.pdf, and file3.pdf, in order. The output file may not be the same as any of the input files.
Pdftk允许我们选择从每个文档中取出哪些页面,以及每个输出页面的查看旋转是什么。通过在输入后按顺序列出它们来使用此类页面范围。例如:
Pdftk allows us to choose which pages are taken from each document, and what the viewing rotation of each output page is. Such page ranges are used by listing them in order after the inputs. For example:
pdftk file1.pdf file2.pdf 1-5 even output out.pdf
从 file1.pdf获取第一页到第五页,从file2.pdf 获取第二、四、六页…… 。
takes pages one to five inclusive from file1.pdf and pages two, four, six… from file2.pdf.
要在输出中的两个或多个不同点包含来自文件的页面,我们可以通过编写(例如)将每个文件的句柄A=input.pdf关联起来,并在给出页面范围时引用这些句柄。
To include pages from a file at two or more distinct points in the
output, we can associate handles with each file by
writing, for example A=input.pdf, and
refer to those handles when giving the page ranges.
A1 A B文档的第一页A(作为封面复制),然后是整个文档A和B.
A1 A B The first page of
document A (duplicated as a cover
page), then the whole of documents A and B.
A4-50oddD文件的奇数页标记A在 4 到 50 之间,旋转 180°。
A4-50oddD Odd pages of file
labeled A between 4 and 50, rotated
by 180°.
例如:
For example:
pdftk A=file.pdf B=file2.pdf A1 A B output out.pdf
要以pdftk的方式执行 PDF 文件的简单合并,可能会执行以下步骤:
To perform a simple merge of PDF files in the manner of pdftk, the following steps might be performed:
将每个文件读入内存并创建 PDF 对象图,可能是惰性的(即,按需解析对象,因为如果只包含某些页面,则不需要所有对象)。
Read each file into memory and create a graph of PDF objects, possibly lazily (i.e., parsing objects on demand, since not all of them will be needed if only certain pages are included).
重新编号对象图中的对象,使它们互斥,即 1...p、p+1...q、q+1...r 等。
Renumber the objects in the object graphs so they are mutually exclusive i.e., 1...p, p+1...q, q+1...r etc.
将所有这些 PDF 对象放入一个新的对象图中。
Put all these PDF objects into a new object graph.
创建一个新的页面树,包含原始文件中所需的页面对象组合。
Create a new page tree, containing the required combination of page objects from the original files.
创建一个新的尾部字典和根对象,链接到新的页面树。
Create a new trailer dictionary and root object, linking to the new page tree.
将新文档写入文件。
Write the new document to a file.
功能齐全的合并还需要:
A fully functioning merge would also need to:
由于使用页面范围,修剪对文档中不再存在的页面的引用。如果不这样做,对不在输出中的页面的单个引用可能会导致包含该页面中的所有对象,从而使输出膨胀。
Trim references to pages no longer in the document due to the use of a page range. Were this not done, a single reference to a page which is not in the output can result in the inclusion of all of the objects from that page, bloating the output.
删除重复的字体定义。通常,要合并的文件来自同一来源,并且共享字体等内容。可以对这些进行重复数据删除以节省空间。
Remove duplicate font definitions. Often, files to be merged come from the same source, and share content like fonts. These can be deduplicated to save space.
合并文件的其他部分——书签、目标、表单等。一般来说,严格按页的数据会自动存活,但文档范围的数据需要特定的合并支持。
Combine the other parts of the file—bookmarks, destinations, forms and so on. Generally speaking, data which is strictly per-page survives automatically, but document-wide data needs specific merging support.
决定从何处获取元数据和 PDF 版本号(例如,使用输入中最高的 PDF 版本号并从第一个给定文件中获取元数据)。
Making decisions on where to take metadata and PDF version numbers from (for example, using the highest PDF version number amongst the inputs and taking the metadata from the first given file).
要从文档中选择页面,我们使用与合并相同的语法,因为我们的操作等同于仅合并一个具有页面范围的文件:
To take a selection of pages from a document, we use the same syntax as for merging, because our operation is equivalent to merging just one file with a page range:
pdftk file1.pdf 2-20 output out.pdf
这会将第 2-20 页写入输出文件。Pdftk有一个单独的工具,可以使用burst操作将文件分成单独的页面并将它们一次全部写入磁盘 。
This writes pages 2-20 inclusive to the output file. Pdftk has a separate facility for splitting a file into individual pages and writing them all to disk at once, using the burst operation.
pdftk input.pdf burst
默认情况下,这会将页面写入pg_0001.pdf、pdf_0002.pdfprintf等。要使用不同格式的名称写入它们,可以提供内置 C 函数样式的输出字符串。例如:
By default, this writes the pages to pg_0001.pdf, pdf_0002.pdf etc. To write them with
differently-formatted names, an output string in the style of the built-in
C function printf may be provided. For
example:
pdftk input.pdf burst output page%03d.pdf
将创建page001.pdf, page002.pdf等。
would create page001.pdf, page002.pdf etc.
突发操作还将文档的元数据写入文件 doc-data.txt。我们在提取和设置元数据中考虑了此功能。
The burst operation also writes the document’s metadata to the file doc-data.txt. We consider this functionality in Extracting and Setting Metadata.
为了将 PDF 拆分为一页或多页的几个部分,诸如pdftk之类的程序将采取以下步骤:
In order to split a PDF into several parts of one or more pages each, a program such as pdftk would take the following steps:
将输入文档加载并解析到对象图中,可能是惰性的(这样就不必处理不会出现在任何输出中的页面)。
Load and parse the input document into an object graph, possibly lazily (so that pages which aren’t going to appear in any of the output don’t have to be processed).
为每个新文档创建一个新的空 PDF 数据结构。使用与现有文档相同的对象编号,为每个页面范围创建一个新的页面树。
Create a new, empty PDF data structure for each new document. Create a new page tree for each page range, using the same object numbers as the existing document.
将输入 PDF 中的所有对象复制到每个输出 PDF 中。
Copy all the objects from the input PDF into each output PDF.
删除每个 PDF 中不需要的所有对象(即不再引用的对象)。
Remove all objects not required in each PDF (i.e., ones which are no longer referenced).
要正确执行最后一步,重要的是处理书签、目标和其他跨页面对象以删除对不再出现在给定输出文件中的页面的引用,因为单个错误引用可能会导致源文件的整个对象图表被包括在内,即使它不是必需的。
To perform the last step correctly, it is important to process bookmarks, destinations, and other cross-page objects to remove references to pages which no longer appear in a given output file, since a single errant reference could result in a source file’s whole object graph being included, even though none of it is required.
图章是一个PDF 页面,放置在另一个页面之上,以便合并页面内容。水印 (pdftk称为 背景)是相同的,但图章位于现有页面内容下方。如果输入 PDF 的页面具有彩色背景,则此方法效果不佳,因为水印通常不会显示出来。
A stamp is a PDF page placed over another so that the page contents are combined. A watermark (which pdftk calls a background) is the same, but the stamp is placed under the existing page contents. This doesn’t work well if the pages of the input PDF have a colored background, since the watermark often won’t show through.
使用pdftk,这是使用stamp和watermark操作实现的,该操作将图章放置在给定范围内的所有页面上(或下)。如果页面大小不同,图章会缩放以适合并居中。
With pdftk, this is achieved
using the stamp and watermark operations, which place the stamp on
(or under) all the pages in the given range. If the page sizes differ, the
stamp is scaled to fit and centered.
例如:
For example:
pdftk file.pdf stamp stamp.pdf output output.pdf
当像pdftk这样的程序向输入的 PDF 添加图章时,必须执行以下步骤:
When a program like pdftk adds a stamp to an input PDF, the following steps must be taken:
将这两个文件加载并解析为 PDF 对象图。
Load and parse both files into PDF object graphs.
纠正两个 PDF 中的对象编号,使它们互斥。现在可以将图章 PDF 中的对象添加到输入 PDF 中。
Rectify the object numbers in both PDFs so that they are mutually exclusive. The objects from the stamp PDF may now be added to the input PDF.
图章的页面数据根据源 PDF 中每个页面的页面大小进行适当缩放和居中。
The page data for the stamp is appropriately scaled and centered with relation to the page size of each page in the source PDF.
图章的页面数据附加到每个页面上源 PDF 的页面数据。字体和图像等资源必须全部重命名,以免发生冲突。在添加新数据之前,必须匹配任何不匹配的堆栈运算符 ( q/ )。Q
The page data for the stamp is appended to the page data for
source PDF on each page. Resources like fonts and images must all be
renamed so as not to clash. Any unmatched stack operators (q/Q)
must be matched up prior to adding the new data.
现在可以将 PDF 写入输出文件。
The PDF can now be written to the output file.
Pdftk可以将文档的元数据(作者、标题等)提取到文本文件中,可以是 ASCII 格式(非 ASCII 字符编码为 XML 样式的数字实体)或 Unicode UTF8。这是通过dump_data或dump_data_utf8关键字实现的。例如:
Pdftk can extract a document’s
metadata (author, title etc.) to a text file, either in ASCII format (with
non-ASCII characters encoded as XML-style numerical entities) or as
Unicode UTF8. This is achieved with the dump_data or dump_data_utf8 keywords. For example:
pdftk input.pdf dump_data output data.txt
将例 9-1中的数据写入data.txt。
writes the data in Example 9-1 to data.txt.
示例 9-1。pdftk dump_data 操作的示例输出(省略号表示我们为简洁起见截断输出的位置)
Example 9-1. Example output of pdftk dump_data operation (ellipses indicate where we have truncated the output for brevity)
InfoKey: Creator InfoValue: XSL Formatter V4.3 R1 (4,3,2008,0424) for Linux InfoKey: Title InfoValue: PDF Explained InfoKey: Producer InfoValue: Antenna House PDF Output Library 2.6.0 (Linux) InfoKey: ModDate InfoValue: D:20110713115225-05'00' InfoKey: CreationDate InfoValue: D:20110713115225-05'00' PdfID0: 57f4673abea4ca58a27e19bf1871dfa PdfID1: 57f4673abea4ca58a27e19bf1871dfa NumberOfPages: 90 ... BookmarkTitle: Table of Contents BookmarkLevel: 1 BookmarkPageNumber: 5 BookmarkTitle: Preface BookmarkLevel: 1 BookmarkPageNumber: 9 BookmarkTitle: Why Read This Book? BookmarkLevel: 2 BookmarkPageNumber: 9 BookmarkTitle: Audience BookmarkLevel: 2 BookmarkPageNumber: 9 ... PageLabelNewIndex: 1 PageLabelStart: 1 PageLabelNumStyle: DecimalArabicNumerals PageLabelNewIndex: 5 PageLabelStart: 5 PageLabelNumStyle: LowercaseRomanNumerals PageLabelNewIndex: 13 PageLabelStart: 1 PageLabelNumStyle: DecimalArabicNumerals
该数据列出:
This data lists:
文档信息字典中的值和键
Values and keys from the document information dictionary
文档中的页数
The number of pages in the document
书签标题、级别和目标页面
The bookmark titles, levels, and destination pages
页面标签
The page labels
该update_info操作可用于执行相反的操作:设置上面列出的信息。也有相应的update_info_utf8
操作。例如,我们可以修改我们创建的data.txt文件,然后使用update_info:
The update_info operation can be
used to perform the reverse: to set the information listed above. There is
also a corresponding update_info_utf8
operation. For example, we can modify the data.txt file we created and then use update_info:
pdftk input.pdf update_info data.txt output output.pdf
pdftk input.pdf update_info data.txt output output.pdf
PDF 文件可以在文档或页面级别添加附件。PDF 附件的技术基础在第 7 章中讨论。要在文件级别添加附件:
PDF files can have attachments added at the document or page level. The technical foundations of PDF attachments are discussed in Chapter 7. To add an attachment at the file level:
pdftk input.pdf attach_files file1.xls file2.xls output output.pdf
pdftk input.pdf attach_files file1.xls file2.xls output output.pdf
附件被添加到文件级附件列表的末尾。要在页面级别添加附件,请使用to_page关键字:
The attachment is added to the end of the list of file-level
attachments. To add an attachment at the page level, use the to_page keyword:
pdftk input.pdf attach_files file1.xls to_page 4 output output.pdf
pdftk input.pdf attach_files file1.xls to_page 4 output output.pdf
要从文档中提取附件,将它们写入给定目录,我们可以使用unpack_files
关键字:
To extract the attachments from a document, writing them to a given
directory, we can use the unpack_files
keyword:
pdftk input.pdf unpack_files output outputs/
这会将附件以其原始文件名写入 输出目录中。
This writes the attachments, under their original filenames, in the outputs directory.
Pdftk具有读取加密文件和加密输出文件的功能。
Pdftk has facilities for reading encrypted files, and for encrypting the output file.
input_pw关键字可用于指定输入文件的所有者密码。密码通过句柄与输入相关联,就像页面范围一样。如果没有给出句柄,则假定密码以与输入文件相同的顺序给出。如果改为提供用户密码,则大多数
pdftk功能将不可用,因为 PDF 安全模型会阻止它。
The input_pw keyword can be
used to specify owner passwords for the input file(s). The passwords are
associated with the inputs by using handles, as with page ranges. If no
handles are given, the passwords are assumed to be given in the same
order as the input files. If the user password is given instead, most
pdftk features will not be available,
because the PDF security model would prevent it.
例如,要合并两个加密的文件,可以这样提供密码:
For example, to merge two files which are encrypted, the passwords can be provided thus:
pdftk file1.pdf file2.pdf input_pw fred charles output out.pdf
pdftk file1.pdf file2.pdf input_pw fred charles output out.pdf
这里,“ fred”是file1.pdf的密码,
“ charles”是file2.pdf的密码。
Here, “fred” is
the password for file1.pdf,
“charles” the password
for file2.pdf.
Pdftkencrypt_40bit可以使用和encrypt_128bit关键字使用 40 位或 128 位 RC4 加密方法对输出进行加密。owner_pw我们可以使用和user_pw关键字指定所有者和用户密码。例如,要使用所有者密码对文件进行 128 位加密,但用户密码为空:
Pdftk can encrypt the output
using the 40-bit or 128-bit RC4 encryption methods using the encrypt_40bit and encrypt_128bit keywords. We can specify the
owner and user passwords using the owner_pw and user_pw keywords. For example, to encrypt a
file with 128-bit encryption using an owner password, but the blank user
password:
pdftk input.pdf output output.pdf encrypt_128bit owner_pw fred
pdftk input.pdf output output.pdf encrypt_128bit owner_pw fred
请注意,我们省略了user_pw
关键字以指示空白用户密码。
Notice we leave out the user_pw
keyword to indicate a blank user password.
我们还没有指定输入用户密码时允许的操作。这可以通过使用allow具有一个或多个权限的关键字来完成(对应于第 8 章中列举的权限):
We have not yet specified the operations to be allowed when the
user password is entered. This can be done by using the allow keyword with one or more of the
permissions (corresponding to those enumerated in Chapter 8):
Printing |
DegradedPrinting |
ModifyContents |
Assembly |
CopyContents |
ScreenReaders |
ModifyAnnotations |
FillIn |
AllFeatures(以上所有,再加上高质量的印刷) |
例如,允许填写表格,但仅此而已:
For example, to allow form filling, but nothing else:
pdftk input.pdf output output.pdf encrypt_128bit allow FillIn owner_pw fred
pdftk input.pdf output output.pdf encrypt_128bit allow FillIn owner_pw fred
为了查看或编辑图形操作符流等页面级内容,首先需要取消用于数据流的压缩。这可以通过pdftk uncompress修饰符来实现:
In order to view or edit page-level content like streams of graphics
operators, it is necessary first to remove the compression used for the
data stream. This can be achieved with the pdftk uncompress modifier:
pdftk compressed.pdf output uncompressed.pdf uncompress
该过程可以通过以下方式逆转(例如,在手动编辑之后)compress
:
The process can be reversed (following manual editing, for example)
by using compress
instead:
pdftk uncompressed.pdf output compressed.pdf compress
在本章中,我们列出并描述了用于查看、转换、编辑和编程 PDF 文件的软件。我们既考虑开源软件,也考虑由 Adobe 或操作系统制造商提供的零成本商业软件。第三方的商业软件种类繁多,这里不做讨论。
In this chapter we list and describe software for viewing, converting, editing, and programming with PDF files. We consider both open source software, and zero-cost commercial software where it is provided by Adobe or operating system manufacturers. There is a large variety of commercial software from third parties, which we do not discuss here.
我们还列出了进一步文档和信息的来源。
We also list sources of further documentation and information.
PDF 查看器的工作是:
The job of a PDF viewer is to:
显示文档的图形和文本内容。
Display the graphical and textual content of the document.
允许用户使用书签和超链接与文档交互。
Allow the user to interact with the document using bookmarks and hyperlinks.
允许搜索文本内容,并通过剪切和粘贴提取文本。
Enable searching of the textual content, and extraction of text via cut and paste.
并非每个查看器都具有所有这些功能。由于 PDF 格式及其封装的格式(例如,字体和图像)非常复杂,性能可能会有很大差异,尤其是在使用更现代的 PDF 功能的文件上。
Not every viewer has all of these features. Due to the huge complexity of the PDF format and the formats it encapsulates (for example, fonts and images), performance can vary significantly—especially on files using more modern PDF features.
Adobe Reader 是 Adobe 自己的免费 PDF 查看器,也是唯一一个保证支持 Adobe 对 PDF 进行的各种专有扩展(例如,更现代的表单和注释)的软件。它带有适用于常见网络浏览器的 PDF 插件,可用于 Microsoft Windows、Mac OS X、Linux、Solaris 和 Android。它允许以电子方式填写和提交表格。
Adobe Reader is Adobe’s own, free PDF viewer and the only one guaranteed to support the various proprietary extensions Adobe has made to PDF (for example, the more modern kinds of forms and annotations). It comes with a PDF plug-in for common web browsers, and is available for Microsoft Windows, Mac OS X, Linux, Solaris, and Android. It allows forms to be filled in and submitted electronically.
可以在Adobe 的网站上找到 Adobe Reader 。
Adobe Reader can be found at Adobe’s website.
许多 Mac OS X 用户更喜欢操作系统提供的快速、简单的 PDF 查看器预览。与 Adobe Reader 相比,它启动更快,使用起来更顺畅,对搜索和提取文本的支持也很好。当 PDF 查看器作为插件加载到 Web 浏览器窗口中时,快速启动尤为重要。通常,Acrobat Reader 也会在 Preview 不支持文件的情况下安装(例如,使用 JavaScript 填写纳税申报单的表格)。
Many Mac OS X users prefer the fast, simple PDF viewer Preview, provided with the operating system. It launches more quickly, and is smoother in use than Adobe Reader, with good support for searching and extracting text. Quick launching is especially important when the PDF viewer is loaded within a web browser window as a plug-in. Typically, Acrobat Reader is also installed for the occasions when Preview doesn’t support a file (for example, a fillable form with JavaScript for a tax return).
此外,Preview 的编辑功能有限(但在增加),如在Mac OS X 上使用 Preview 进行编辑中所述。
In addition, Preview has limited (but increasing) editing capabilities, described in Editing with Preview on Mac OS X.
Xpdf 是一个小巧、快速、开源的 PDF 查看器,几乎可以在任何可使用 X Window 系统的类 Unix 计算机上运行。对高级 PDF 工具的支持是有限的,但它是一个在其能力范围内高度可靠的文件程序。
Xpdf is a small, fast, open source PDF viewer, running on virtually any Unix-like computer where The X Window System is available. Support for advanced PDF facilities is limited, but it is a highly reliable program for files within its capabilities.
Xpdf 可以在Foo Labs 的网站上找到。
Xpdf can be found at Foo Labs’ website.
GSview 是用于 Microsoft Windows 和 Unix 的开源 PDF 和 PostScript 查看器。它基于久负盛名且高度可靠的 GhostScript PDF 和 PostScript 解释器。
GSview is an open source PDF and PostScript viewer for Microsoft Windows and Unix. It is based on the venerable and highly reliable GhostScript PDF and PostScript interpreter.
GSview 和 GhostScript(GSview 需要)可以从GhostScript 网站下载。
GSview and GhostScript (which is required by GSview) can be downloaded from the GhostScript website.
Adobe 基于与 Acrobat 本身相同的代码,为 PDF 操作提供了自己昂贵的商业许可库。在本节中,我们将考虑流行的开源替代方案。
Adobe provides its own expensive, commercially-licensed library for PDF manipulation, based on the same code as Acrobat itself. In this section, we consider popular open source alternatives.
通常,构建库来编写 PDF 文件比阅读它们要容易得多。要编写文件,只需了解特定应用程序所需的一小部分 PDF(即一种压缩机制、一种字体类型等),无需复杂的解析机制。要读取文件,必须执行整个标准。
In general, it’s much easier to build libraries to write PDF files than to read them. To write a file, one need only understand the small subset of PDF required for a particular application (i.e., one compression mechanism, one font type etc.) and no complicated parsing mechanisms. To read a file, one must implement the whole standard.
iText 是一个成熟的开源库,用于阅读和编写 PDF 文档,以及使用段落、列表、表格和图像等高级构建块制作文本报告。它还支持构建书签、超链接、注释和 JavaScript 操作。可以构建可填写的表单,并支持加密文件。
iText is a mature open source library for reading and writing PDF documents, and for making textual reports using high-level building blocks such as paragraphs, lists, tables, and images. It also has support for building bookmarks, hyperlinks, annotations, and JavaScript actions. Fillable forms can be constructed, and encrypted files are supported.
iText 可以从iText Software 网站下载。
iText can be downloaded from the iText Software website.
TCPDF 是一个纯 PHP 库,用于生成 PDF 报告,包括文本布局、表格、HTML 转换、注释、超链接和图像。Web 服务可以使用 TCPDF 动态构建文档并将其提供给在 Web 浏览器中运行的 PDF 查看器,或者通过电子邮件发送。
TCPDF is a pure PHP library for the generation of PDF reports, including text layout, tables, conversion of HTML, annotations, hyperlinks, and images. Web services can use TCPDF to build a document dynamically and serve it to a PDF viewer running within a web browser, or send it by email.
可以从其网站下载 TCPDF 及其范围广泛的示例。
TCPDF can be downloaded, together with a wide range of examples from its website.
有大量用于在 Perl 中阅读、编写和编辑 PDF 文件的 PDF 库,其中一些非常成熟,另一些则不太成熟。文档通常很少,掩盖了可用的广泛功能。
There are a large number of PDF libraries for reading, writing, and editing PDF files in Perl, some of which are highly mature, others less so. Documentation is often sparse, belying the extensive capabilities available.
与所有免费的 Perl 模块一样,Comprehensive Perl Archive Network 拥有源代码和文档。
As with all free Perl modules, the Comprehensive Perl Archive Network holds both source code and documentation.
Apple 的 PDFKit 提供了许多类,供 Apple 支持的编程语言(例如 Objective C)使用。这些包括:
Apple’s PDFKit provides a number of classes for use with Apple’s supported programming languages (such as Objective C). These include:
PDFView,PDF 文档的屏幕视图。
PDFView, an onscreen view on a PDF document.
PDFDocument 和 PDFPage 用于文档和页面级别的操作。
PDFDocument and PDFPage for document and page-level manipulation.
用于交互式工具的 PDFAnnotation、PDFAction、PDFOutline 和 PDFSelection。
PDFAnnotation, PDFAction, PDFOutline, and PDFSelection for interactive facilities.
Apple 的内置 PDF 查看器 Preview 就是基于这些库构建的。PDF 套件库记录在 Apple 的Mac OS X 开发者库中。
Apple’s built-in PDF viewer, Preview, is built on these libraries. The PDF Kit Libraries are documented in Apple’s Mac OS X Developer Library.
格式转换分为三类:
Format conversions come in three categories:
与类似的、可缩放的矢量格式(例如 PostScript 或 SVG)相互转换。在这种情况下,结构信息通常会得到很好的保存。
Converting to or from a similar, scalable vector format (e.g., PostScript or SVG). In this case, structural information is often preserved well.
从 PDF 转换为光栅图像,例如 PNG 或 TIFF。
Converting from a PDF to a raster image, such as a PNG or TIFF.
从光栅图像转换为 PDF,这通常只涉及简单的封装,尤其是在 PDF 知道的格式的情况下,如 JPEG。
Converting from a raster image to a PDF, which often just involves simple encapsulation, especially in the case of formats PDF knows about, like JPEG.
GhostScript 附带的pdf2ps和ps2pdf命令行程序可以在 PDF 和 PostScript 之间进行转换。有时这涉及相当复杂和缓慢的处理,这可能导致更大的文件大小或某些结构的丢失(例如,文本被转换为轮廓)。毕竟,PDF 和 PostScript 非常不同 — 尽管它们具有共同的传统。
The pdf2ps and ps2pdf command-line programs which ship with GhostScript can convert between PDF and PostScript. Sometimes this involves quite complicated and slow processing which may lead to larger file sizes or the loss of some constructs (for example, text being converted to outlines). PDF and PostScript are, after all, very different—despite a shared heritage.
ps2pdf和pdf2ps可从GhostScript 主页获得。
ps2pdf and pdf2ps are available from the GhostScript home page.
GhostScript 附带的gs程序可用于将 PDF 页面渲染为给定分辨率的光栅图像,适合打印或屏幕使用。这是 GSView 用来显示 PDF 页面的工具。这是通过指定对应于图像文件格式(如 PNG 和 TIFF)的几种特殊输出设备之一来实现的。
The gs program which comes with GhostScript can be used to render a PDF page to a raster image at a given resolution, suitable for printing or for onscreen use. This is the facility used by GSView to display PDF pages. This is achieved by specifying one of several special output devices which correspond to image file formats, such as PNG and TIFF.
gs是 GhostScript 系统的一部分,可从GhostScript 主页获得。
gs is part of the GhostScript system, available from the GhostScript home page.
大多数现代文字处理器都可以导出为 PDF,维护超链接并为目录建立书签。但是,通常需要从不具备将其原始格式转换为 PDF 的工具的程序生成 PDF 输出。这可以通过使用将 PDF 写入文件而不是打印它的打印机驱动程序来实现。
Most modern word-processors have the facility to export as PDF, maintaining hyperlinks and building bookmarks for the table of contents. However, it is often necessary to produce PDF output from programs which do not have the facility to convert their native format to PDF. This can be achieved by the use of a printer driver which writes the PDF to a file, instead of printing it.
Mac OS X 通过打印对话框中的“另存为 PDF ”选项在本机提供此功能。
Mac OS X provides this facility natively, through the “Save as PDF” option in the print dialog.
在 Unix 平台上,此功能由 CUPS 打印系统的开源 CUPS-PDF 后端提供。
On Unix platforms, this facility is provided by the open source CUPS-PDF backend to the CUPS printing system.
在 Microsoft Windows 上,开源PDFCreator 打印机驱动程序可以完成相同的工作。它在内部使用 GhostScript。
On Microsoft Windows, the open source PDFCreator printer driver achieves the same job. It uses GhostScript internally.
PDF 最初并不是为了进行大量编辑,而是作为一种可扩展的、结构化的最终格式用于发布。因此,大多数编辑软件都具有有限且特定的编辑功能,例如合并文件、添加批注、填写表格或对页面内容进行小幅编辑。
PDFs were not originally intended to be edited significantly, but as a scalable, structured end-format for publishing. Thus, most editing software has restricted and specific editing functions such as merging files, adding annotations, filling in forms, or making small edits to page content.
在第 9 章中,我们了解了pdftk,这是一个用于 PDF 文件命令行操作的开源程序。在本节中,我们列出了编辑现有 PDF 文件的其他方法。
In Chapter 9 we looked at pdftk, an open source program for command-line manipulation of PDF files. In this section, we list other ways of editing existing PDF files.
Adobe 自己的 PDF 编辑器 Acrobat(售价数百美元)具有广泛的功能,超过免费的 Adobe Reader。这包括:
Adobe’s own PDF editor, Acrobat (which costs several hundred dollars) has a wide range of functionality, over and above that of the free Adobe Reader. This includes:
打印为 PDF,以及从 PostScript 转换为 PDF。
Printing to PDF, and conversion from PostScript to PDF.
Microsoft Word 和 Excel 之间的相互转换。
Conversion to and from Microsoft Word and Excel.
光学字符识别 (OCR),生成看起来与扫描文档完全一样的 PDF 文件,但具有可搜索、可编辑的文本。
Optical Character Recognition (OCR), producing a PDF file which looks exactly like the scanned document, but has searchable, editable text.
重新排序、旋转和编辑页面和内容。
Reordering, rotating, and editing pages and contents.
预检和打印发布工具。
Preflight and print publishing tools.
构建 PDF 表单。
Building PDF forms.
创建和验证 PDF/A 和 PDF/X。
Creating and validating PDF/A and PDF/X.
添加加密和数字签名。
Adding encryption and digital signatures.
有许多可用于 Adobe Acrobat 的商业第三方插件,提供额外的功能。
There are many commercial third party plug-ins available for Adobe Acrobat, providing extra functionality.
Preview 是 Mac OS X 上的标准 PDF 查看程序,也有编辑工具,但由于它们在界面中并不突出,因此往往未被充分利用。
Preview, the standard PDF Viewing program on Mac OS X, also has editing facilites, which tend to be underused since they are not prominent in the interface.
预览可以注释 PDF 文档、突出显示和删除文本、裁剪页面、添加文本、添加超链接、删除和重新排列页面以及合并 PDF。
Preview can annotate PDF documents, highlight and strike through text, crop pages, add text, add hyperlinks, delete and rearrange pages, and merge PDFs.
预览处理范围广泛的文档,并设法保留在编辑文件的其他方面时不理解的功能。
Preview deals with a wide range of documents, and manages to preserve functionality it doesn’t understand when editing other aspects of the file.
本书旨在填补 PDF 文献中的一个显着空白。在这里,我们列出了其他信息和文档来源。
This book was written to fill a conspicuous gap in PDF literature. Here, we list other sources of information and documentation.
在PDF 版本 1.6 之前, PDF 参考手册作为一本书出版。现在,唉(但也许恰如其分,考虑到它的主题),它只能以 PDF 文档的形式提供。
The PDF Reference Manual was published as a book until PDF version 1.6. Now, alas (but perhaps fittingly, given its subject matter), it is only available as a PDF document.
PDF 版本 1.7 于 2008 年被批准为 ISO 标准(标准编号 ISO 32000-1:2008)。ISO 对 PDF 副本(通过下载或 CD-ROM)收费近 500 美元。幸运的是,Adobe 继续以电子方式提供 PDF 版本 1.7 参考。这是 ISO 32000-1:2008 的批准副本。特别是,章、节和小节编号是相同的。
PDF version 1.7 was ratified as an ISO Standard in 2008 (Standard number ISO 32000-1:2008). The ISO charges almost 500 US Dollars for a PDF copy (by download, or on CD-ROM). Luckily, Adobe continues to provide the PDF Version 1.7 Reference electronically. This is an approved copy of ISO 32000-1:2008. In particular, the chapter, section, and subsection numbers are identical.
Adobe 对 PDF 1.7 的最新扩展记录在 ExtensionLevel文档中,这些文档不构成 ISO 标准的一部分,但预计将构成以后更新的标准的一部分。
More recent Adobe extensions to PDF 1.7 are documented in ExtensionLevel documents, which do not form part of the ISO Standard, but would be expected to form part of a later, updated one.
Adobe 的 ISO 32000-1:2008 副本和 ExtensionLevel 文档都可以从Adobe Developer Connection Website下载。
Both Adobe’s copy of ISO 32000-1:2008 and the ExtensionLevel documents can be downloaded from the Adobe Developer Connection Website.
O'Reilly 的另一个 PDF 标题, Sid Steward 的PDF Hacks,强调了对各种 PDF 问题的实用解决方案。它包括 100 个独立的技巧:
O’Reilly’s other PDF title, PDF Hacks by Sid Steward, emphasizes practical solutions to a wide range of PDF problems. It includes 100 separate hacks to:
自定义 PDF 查看器,使阅读 PDF 更加舒适。
Customize PDF viewers to make reading PDFs more comfortable.
将巨大的 PDF 文件“重新压缩”为更小的文件。
“Refry” huge PDF files into much smaller files.
使用多种平台上的各种工具创建 PDF 文件。
Create PDF files with a variety of tools on a number of platforms.
从 gVim 文本编辑器编辑 PDF 文本。
Edit PDF text from the gVim text editor.
使用熟悉的软件创建具有高级导航功能的 PDF。
Use familiar software to create PDFs with advanced navigation features.
使用复杂的导航和交互功能构建 PDF。
Build PDFs with sophisticated navigation and interactive features.
即时生成 PDF。
Generate PDFs on the fly.
将 PDF 文件与网站集成,超越简单的超链接。
Integrate PDF files with websites beyond a simple hyperlink.
使用 PDF 表单在网站上收集数据。
Collect data on a website with PDF forms.
索引和比较 PDF 文件。
Index and compare PDF files.
将传入的传真转换为 PDF。
Convert incoming faxes to PDF.
编写控制 Adobe Acrobat 的脚本。
Write scripts that control Adobe Acrobat.
PDF 标准和本书参考了(有时假定了解)计算机图形学的一般领域。这些主题的标准参考是计算机图形学原理和实践(Foley 等人,Addison-Wesley 1990)。本书包含贝塞尔曲线、透明度、仿射变换以及了解如何编写 PDF 图形流所需的其他主题的所有背景知识。
The PDF standard and this book make reference to (and sometimes assume knowledge of) the general area of computer graphics. The standard reference for these topics is Computer Graphics Principles and Practice (Foley et al., Addison-Wesley 1990). This book contains all the background on Bézier curves, transparency, affine transformations, and other topics needed to understand how to write PDF graphics streams.
理解 PDF 中的字典、树和其他数据结构以及选择它们的原因的一个很好的参考是 算法(Cormen 等人,麻省理工学院出版社,1990 年)。任何有关算法的类似书籍都应该足够了。
A good reference for understanding the dictionaries, trees, and other data structures in PDF and why they were chosen is Algorithms (Cormen et al., MIT Press, 1990). Any similar book on algorithms should suffice.
有许多地方可以讨论技术 PDF 主题:
There are a number of places to discuss technical PDF topics:
Planet PDF 论坛是各种技术和非技术 PDF 讨论的热门场所。
The Planet PDF Forums are a popular venue for all sorts of technical and nontechnical PDF discussions.
Adobe 的Adobe Reader 论坛,提供 Adobe Reader的技术支持和讨论。
Adobe’s Adobe Reader Forums for technical support and discussion for Adobe Reader.
usenet 新闻组是进行更多技术讨论的comp.text.pdf低流量场所。
The comp.text.pdf usenet
newsgroup is a low traffic place for more technical discussions.
对于那些对 PDF 技术方面感兴趣的人,Adobe 网站有两个相关部分:
There are two relevant sections of the Adobe website for those interested in the technical aspects of PDF:
PDF 技术中心包含 PDF参考文档。
The PDF Technology Center contains PDF reference documents.
Acrobat 开发人员中心拥有用于编写 Acrobat 插件、FDF 表单格式和开发人员知识库的资源和文档。
The Acrobat Developer Center has resources and documentation for writing Acrobat plug-ins, the FDF forms format, and a developer knowledge base.
立即在 oreilly.com 以 4.99 美元的价格升级这本电子书,并访问其他无 DRM 格式,包括 PDF 和 EPUB,以及终身免费更新。
Upgrade this ebook today for $4.99 at oreilly.com and get access to additional DRM-free formats, including PDF and EPUB, along with free lifetime updates.
| 修订记录 | |
|---|---|
| 2011-11-30 | 首次发布 |
版权所有 © 2011 John Whitington
Copyright © 2011 John Whitington
购买 O'Reilly 书籍可用于教育、商业或促销用途。大多数书籍也有在线版本 ( http://my.safaribooksonline.com )。如需更多信息,请联系我们的公司/机构销售部门:(800) 998-9938 或corporate@oreilly.com。
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.
Nutshell Handbook、Nutshell Handbook 徽标和 O'Reilly 徽标是 O'Reilly Media, Inc. 的注册商标。PDF Explained、小食蚁兽的图像和相关商业外观是 O'Reilly Media, Inc. 的商标。
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. PDF Explained, the image of a lesser anteater, and related trade dress are trademarks of O’Reilly Media, Inc.
制造商和销售商用来区分其产品的许多名称都被声明为商标。如果这些名称出现在本书中,并且 O'Reilly Media, Inc. 知道商标声明,则这些名称已印在大写字母或首字母大写字母中。
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.
1005 Gravenstein 公路北
1005 Gravenstein Highway North
塞瓦斯托波尔,CA 95472
Sebastopol, CA 95472